[00:05:11] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational [00:06:48] (03PS1) 10Dzahn: icinga: remove 'system/process command access' for everyone [puppet] - 10https://gerrit.wikimedia.org/r/506579 [00:10:45] (03CR) 10Dzahn: [C: 03+2] "absented in admin module but this is another place where things can be (for offboarding workflow)" [puppet] - 10https://gerrit.wikimedia.org/r/506578 (owner: 10Dzahn) [00:11:23] PROBLEM - HP RAID on ms-be2038 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.82: Connection reset by peer [00:13:51] 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) a:03Dzahn [00:18:26] (03CR) 10CDanis: [C: 03+1] raid: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/506548 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [00:18:53] (03CR) 10CDanis: [C: 03+1] icinga/nagios_common: add Willy Pao to group misleadingly called 'sms' [puppet] - 10https://gerrit.wikimedia.org/r/506571 (https://phabricator.wikimedia.org/T221142) (owner: 10Dzahn) [00:26:44] PROBLEM - swift-object-auditor on ms-be2035 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.165: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:26:46] PROBLEM - Disk space on ms-be2035 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.165: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [00:26:50] PROBLEM - swift-object-server on ms-be2035 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.165: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:26:50] PROBLEM - Check size of conntrack table on ms-be2035 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.165: Connection reset by peer [00:26:52] PROBLEM - swift-container-auditor on ms-be2035 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.165: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:26:58] PROBLEM - very high load average likely xfs on ms-be2035 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.165: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:27:02] PROBLEM - swift-account-replicator on ms-be2035 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.165: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:27:04] PROBLEM - configured eth on ms-be2035 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.165: Connection reset by peer [00:27:24] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2035 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.165: Connection reset by peer [00:27:30] PROBLEM - swift-object-updater on ms-be2035 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.165: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [00:27:44] RECOVERY - swift-object-auditor on ms-be2035 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [00:27:46] RECOVERY - Disk space on ms-be2035 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [00:27:48] RECOVERY - swift-object-server on ms-be2035 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift [00:27:50] RECOVERY - Check size of conntrack table on ms-be2035 is OK: OK: nf_conntrack is 3 % full [00:27:54] RECOVERY - swift-container-auditor on ms-be2035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift [00:27:58] RECOVERY - very high load average likely xfs on ms-be2035 is OK: OK - load average: 36.66, 43.04, 43.50 https://wikitech.wikimedia.org/wiki/Swift [00:28:00] RECOVERY - swift-account-replicator on ms-be2035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [00:28:03] wowee [00:28:04] RECOVERY - configured eth on ms-be2035 is OK: OK - interfaces up [00:28:24] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2035 is OK: OK ferm input default policy is set [00:28:30] RECOVERY - swift-object-updater on ms-be2035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift [00:42:02] (03PS2) 10Dzahn: lvs: add runbook for check_rp_filter_disabled [puppet] - 10https://gerrit.wikimedia.org/r/506549 [00:44:50] RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational [01:31:20] yeah that's about what it was doing yesterday [01:31:27] :/ [01:56:18] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer [01:57:30] RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational [02:42:19] (03PS1) 10Andrew Bogott: Allow puppet-merge to merge the labs/private repo [puppet] - 10https://gerrit.wikimedia.org/r/506582 (https://phabricator.wikimedia.org/T221888) [02:44:22] (03PS2) 10Andrew Bogott: Allow puppet-merge to merge the labs/private repo [puppet] - 10https://gerrit.wikimedia.org/r/506582 (https://phabricator.wikimedia.org/T221888) [04:24:18] PROBLEM - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:25:00] PROBLEM - Mediawiki Cirrussearch update lag - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:26:40] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:41:02] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:41:56] RECOVERY - Mediawiki Cirrussearch update lag - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:42:30] RECOVERY - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:46:03] 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1006 with 10G interfaces - https://phabricator.wikimedia.org/T221048 (10Andrew) [04:46:12] 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Andrew) [04:46:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [04:46:30] 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1006 with 10G interfaces - https://phabricator.wikimedia.org/T221048 (10Andrew) 05Open→03Resolved [04:46:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [04:46:47] 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Andrew) 05Open→03Resolved [04:53:47] (03PS3) 10Marostegui: site.pp: Remove pc1004-pc1006 [puppet] - 10https://gerrit.wikimedia.org/r/506415 (https://phabricator.wikimedia.org/T210969) [04:54:00] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506586 [04:56:12] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506586 (owner: 10Marostegui) [04:57:10] PROBLEM - puppet last run on db1102 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:57:14] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506586 (owner: 10Marostegui) [04:58:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1113:3316 T221782 (duration: 00m 56s) [04:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:28] T221782: Fix revision special indexes and partitions on db1103:3314 and db1113:3316 - https://phabricator.wikimedia.org/T221782 [04:59:29] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove pc1004-pc1006 [puppet] - 10https://gerrit.wikimedia.org/r/506415 (https://phabricator.wikimedia.org/T210969) (owner: 10Marostegui) [05:00:53] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10Marostegui) @RobH @Cmjohnson I have removed the spare role entries, so only pending the DNS entries... [05:01:01] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10Marostegui) a:05Cmjohnson→03RobH [05:04:40] PROBLEM - puppet last run on wtp1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:06:05] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506586 (owner: 10Marostegui) [05:07:04] (03PS1) 10BryanDavis: wikitech: Disable Gerrit accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) [05:21:41] (03PS1) 10BryanDavis: wikitech: Provision gerrit api auth credentials [puppet] - 10https://gerrit.wikimedia.org/r/506588 (https://phabricator.wikimedia.org/T218654) [05:23:38] RECOVERY - puppet last run on db1102 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [05:34:46] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:36:24] RECOVERY - puppet last run on wtp1046 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:01:14] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:22:25] (03PS3) 10Elukey: profile::analytics::database::meta: add properties to my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/506179 (https://phabricator.wikimedia.org/T212243) [06:23:05] (03PS1) 10Marostegui: db*.yaml: Add hostname [puppet] - 10https://gerrit.wikimedia.org/r/506589 [06:25:54] (03CR) 10Elukey: [C: 03+2] profile::analytics::database::meta: add properties to my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/506179 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey) [06:30:46] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:30:47] (03PS2) 10Marostegui: db*.yaml: Add hostname [puppet] - 10https://gerrit.wikimedia.org/r/506589 [06:32:50] RECOVERY - HP RAID on ms-be2038 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [06:33:11] (03PS3) 10Marostegui: db*.yaml: Add hostname as a comment [puppet] - 10https://gerrit.wikimedia.org/r/506589 [06:38:54] (03CR) 10Muehlenhoff: [C: 03+1] "Makes sense, but would like to get confirmation by Chad before merging." [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) (owner: 10Dzahn) [06:39:34] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/swift/swift-drive-audit.conf] [06:41:22] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:50:10] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 17 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/swift/swift-drive-audit.conf] [06:59:35] 10Operations, 10ops-eqiad: Degraded RAID on labcontrol1001 - https://phabricator.wikimedia.org/T221911 (10MoritzMuehlenhoff) 05Open→03Invalid Seems like fallout of the decom steps which happened in T221817. [06:59:43] 10Operations, 10ops-eqiad: Degraded RAID on labcontrol1001 - https://phabricator.wikimedia.org/T221910 (10MoritzMuehlenhoff) 05Open→03Invalid Seems like fallout of the decom steps which happened in T221817. [06:59:50] 10Operations, 10ops-eqiad: Degraded RAID on labcontrol1002 - https://phabricator.wikimedia.org/T221909 (10MoritzMuehlenhoff) 05Open→03Invalid Seems like fallout of the decom steps which happened in T221817. [06:59:55] 10Operations, 10ops-eqiad: Degraded RAID on labcontrol1002 - https://phabricator.wikimedia.org/T221912 (10MoritzMuehlenhoff) 05Open→03Invalid Seems like fallout of the decom steps which happened in T221817. [07:05:58] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:09:36] 10Operations, 10Analytics-Kanban, 10EventBus, 10netops, 10Patch-For-Review: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10elukey) 05Open→03Resolved To keep archives happy: ` term schema { from { destination-address { /* sc... [07:20:07] !log installing glibc updates on a number of analytics hosts [07:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:34] (03CR) 10Marostegui: "This is a noop as expected: https://puppet-compiler.wmflabs.org/compiler1002/16067/" [puppet] - 10https://gerrit.wikimedia.org/r/506589 (owner: 10Marostegui) [07:41:16] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [07:42:32] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.896 second response time https://phabricator.wikimedia.org/T174916 [07:43:29] (03PS1) 10Muehlenhoff: Add sd-pam processes to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/506595 (https://phabricator.wikimedia.org/T135991) [07:46:30] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [07:59:45] 10Operations: Please import php-xdebug to apt.wm.o component/php72 - https://phabricator.wikimedia.org/T221923 (10Mainframe98) [08:03:35] 10Operations: Please import php-xdebug to apt.wm.o component/php72 - https://phabricator.wikimedia.org/T221923 (10Mainframe98) [08:21:42] !log uploaded php-xdebug 2.7.0+wmf1 for component/php72 (T221923) [08:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:49] T221923: Please import php-xdebug to apt.wm.o component/php72 - https://phabricator.wikimedia.org/T221923 [08:22:22] 10Operations: Please import php-xdebug to apt.wm.o component/php72 - https://phabricator.wikimedia.org/T221923 (10MoritzMuehlenhoff) @Mainframe98 : I've uploaded a build of xdebug 2.7.0 for PHP 7.2, let me know if it works for you [08:24:08] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.233 second response time https://phabricator.wikimedia.org/T174916 [08:27:28] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [08:29:36] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.457 second response time https://phabricator.wikimedia.org/T174916 [08:33:12] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [08:41:42] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.049 second response time https://phabricator.wikimedia.org/T174916 [08:42:33] !log restart pdfrender on scb1003 (alert flapping) [08:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:27] (03PS1) 10Ema: package_builder: install lintian from backports [puppet] - 10https://gerrit.wikimedia.org/r/506598 [08:45:30] (03CR) 10Ema: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1002/16068/" [puppet] - 10https://gerrit.wikimedia.org/r/506598 (owner: 10Ema) [08:48:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/506598 (owner: 10Ema) [08:49:36] (03CR) 10Jcrespo: "Ok with this. I have plans to remove most of this after the role refactoring." [puppet] - 10https://gerrit.wikimedia.org/r/506589 (owner: 10Marostegui) [08:51:46] (03CR) 10Marostegui: "> Ok with this. I have plans to remove most of this after the role" [puppet] - 10https://gerrit.wikimedia.org/r/506589 (owner: 10Marostegui) [08:53:41] (03PS3) 10Ema: conftool-data: set cp4021 as the only ats-be in production [puppet] - 10https://gerrit.wikimedia.org/r/506445 (https://phabricator.wikimedia.org/T219967) [08:55:00] (03CR) 10Jcrespo: [C: 03+1] db*.yaml: Add hostname as a comment [puppet] - 10https://gerrit.wikimedia.org/r/506589 (owner: 10Marostegui) [08:55:39] (03CR) 10Ema: [C: 03+2] conftool-data: set cp4021 as the only ats-be in production [puppet] - 10https://gerrit.wikimedia.org/r/506445 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:02:24] (03PS4) 10Marostegui: db*.yaml: Add hostname as a comment [puppet] - 10https://gerrit.wikimedia.org/r/506589 [09:04:24] (03CR) 10Marostegui: [C: 03+2] db*.yaml: Add hostname as a comment [puppet] - 10https://gerrit.wikimedia.org/r/506589 (owner: 10Marostegui) [09:11:34] !log restarting AQS on aqs1004 for glibc update [09:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:07] win 25 [09:23:59] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10MoritzMuehlenhoff) One thing that will need to be fixed is the detection of HP machines to install 'hp-health' in modules/base/manifests/standard_packages.pp:L1... [09:31:18] (03PS2) 10Ema: cache: multiple keyspaces support for directors.frontend.vcl [puppet] - 10https://gerrit.wikimedia.org/r/506480 (https://phabricator.wikimedia.org/T219967) [09:36:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Will indeed make it easier to experiment" [puppet] - 10https://gerrit.wikimedia.org/r/506321 (owner: 10CDanis) [09:36:51] (03CR) 10Filippo Giunchedi: [C: 03+1] package_builder: install lintian from backports [puppet] - 10https://gerrit.wikimedia.org/r/506598 (owner: 10Ema) [09:38:33] (03PS1) 10Elukey: profile::analytics::refinery::repository: use the 'analitics-deploy' user [puppet] - 10https://gerrit.wikimedia.org/r/506609 (https://phabricator.wikimedia.org/T220971) [09:39:40] (03PS15) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [09:42:39] (03CR) 10Ema: "pcc looks good now. The issue was the comment inside <% end %>. https://puppet-compiler.wmflabs.org/compiler1002/16069/" [puppet] - 10https://gerrit.wikimedia.org/r/506480 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:42:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506579 (owner: 10Dzahn) [09:43:15] (03PS1) 10Mathew.onipe: remote: add depooled context mgr [software/spicerack] - 10https://gerrit.wikimedia.org/r/506610 [09:43:44] (03CR) 10Ema: [C: 03+2] cache: multiple keyspaces support for directors.frontend.vcl [puppet] - 10https://gerrit.wikimedia.org/r/506480 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:47:00] (03CR) 10jerkins-bot: [V: 04-1] remote: add depooled context mgr [software/spicerack] - 10https://gerrit.wikimedia.org/r/506610 (owner: 10Mathew.onipe) [09:47:54] (03CR) 10Joal: "Yes! /me loves stories" [puppet] - 10https://gerrit.wikimedia.org/r/506609 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [09:48:56] !log Remove labtestservices2001 from tendril - T218022 [09:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:03] T218022: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 [09:50:35] (03PS2) 10Elukey: profile::analytics::refinery::repository: use the 'analitics-deploy' user [puppet] - 10https://gerrit.wikimedia.org/r/506609 (https://phabricator.wikimedia.org/T220971) [09:50:53] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Epic: [Epic] Scaling strategy for Wikidata Query Service - https://phabricator.wikimedia.org/T221938 (10Gehel) [09:50:57] (03CR) 10Jbond: package_builder: install lintian from backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506598 (owner: 10Ema) [09:53:26] (03PS16) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [09:54:40] (03CR) 10Muehlenhoff: [C: 03+1] package_builder: install lintian from backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506598 (owner: 10Ema) [09:56:09] (03PS1) 10Jbond: facter3/puppet5: upgrade canary-bastion host [puppet] - 10https://gerrit.wikimedia.org/r/506613 (https://phabricator.wikimedia.org/T219803) [09:57:07] (03CR) 10Jbond: [C: 03+1] package_builder: install lintian from backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506598 (owner: 10Ema) [09:57:33] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: upgrade canary-bastion host [puppet] - 10https://gerrit.wikimedia.org/r/506613 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [09:57:43] (03PS2) 10Jbond: facter3/puppet5: upgrade canary-bastion host [puppet] - 10https://gerrit.wikimedia.org/r/506613 (https://phabricator.wikimedia.org/T219803) [09:58:38] (03PS5) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [09:59:56] (03PS1) 10Arturo Borrero Gonzalez: standard: refactor into a profile [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) [10:00:49] (03CR) 10jerkins-bot: [V: 04-1] standard: refactor into a profile [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [10:03:44] (03PS3) 10Elukey: profile::analytics::refinery::repository: use the 'analitics-deploy' user [puppet] - 10https://gerrit.wikimedia.org/r/506609 (https://phabricator.wikimedia.org/T220971) [10:04:44] (03PS2) 10Arturo Borrero Gonzalez: standard: refactor into a profile [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) [10:05:20] (03CR) 10jerkins-bot: [V: 04-1] standard: refactor into a profile [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [10:05:38] (03PS3) 10Arturo Borrero Gonzalez: standard: refactor into a profile [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) [10:06:21] (03CR) 10jerkins-bot: [V: 04-1] standard: refactor into a profile [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [10:07:47] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/16074/" [puppet] - 10https://gerrit.wikimedia.org/r/506609 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [10:09:34] (03PS1) 10Ema: conftool-data: define ats-be for text/upload in all DCs [puppet] - 10https://gerrit.wikimedia.org/r/506624 (https://phabricator.wikimedia.org/T219967) [10:13:03] 10Operations: Investigate use of hp-asrd on HPE servers - https://phabricator.wikimedia.org/T221939 (10MoritzMuehlenhoff) [10:13:50] (03CR) 10Vgutierrez: [C: 03+1] "my OCD is becoming a CDO, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506624 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:13:59] (03CR) 10Ema: [C: 03+2] conftool-data: define ats-be for text/upload in all DCs [puppet] - 10https://gerrit.wikimedia.org/r/506624 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:14:11] (03PS1) 10Gilles: Fix coal syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/506626 (https://phabricator.wikimedia.org/T221401) [10:18:42] PROBLEM - Check systemd state on analytics1052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:18:50] (03PS1) 10Jbond: facter3/puppet5: add version glue back [puppet] - 10https://gerrit.wikimedia.org/r/506628 (https://phabricator.wikimedia.org/T219803) [10:19:09] !log depool cp3030 for testing T219967 [10:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:14] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [10:23:55] (03PS1) 10Ema: Revert "conftool-data: define ats-be for text/upload in all DCs" [puppet] - 10https://gerrit.wikimedia.org/r/506633 [10:24:14] PROBLEM - puppet last run on mw1325 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:24:23] (03CR) 10jerkins-bot: [V: 04-1] Revert "conftool-data: define ats-be for text/upload in all DCs" [puppet] - 10https://gerrit.wikimedia.org/r/506633 (owner: 10Ema) [10:24:57] (03PS1) 10Jbond: ulogd: rename nflog comment [puppet] - 10https://gerrit.wikimedia.org/r/506634 (https://phabricator.wikimedia.org/T116011) [10:24:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/506628 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [10:25:22] (03PS2) 10Ema: Revert "conftool-data: define ats-be for text/upload in all DCs" [puppet] - 10https://gerrit.wikimedia.org/r/506633 (https://phabricator.wikimedia.org/T219967) [10:25:57] 10Operations, 10Patch-For-Review: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011 (10jbond) >Looking at cumin1001 I noticed that the log prefix at the end of the input chan is "fw-out-drop" and the output chain is empty with an accept policy. Is "out" indeed the direction in this case? Or... [10:27:47] (03CR) 10Ema: [C: 03+2] Revert "conftool-data: define ats-be for text/upload in all DCs" [puppet] - 10https://gerrit.wikimedia.org/r/506633 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:28:42] !log restarting Parsoid on wtp1025 for glibc update [10:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:51] (03PS1) 10Ema: cache: hiera setting to list backend services [puppet] - 10https://gerrit.wikimedia.org/r/506636 (https://phabricator.wikimedia.org/T219967) [10:39:22] (03CR) 10jerkins-bot: [V: 04-1] cache: hiera setting to list backend services [puppet] - 10https://gerrit.wikimedia.org/r/506636 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:41:26] (03CR) 10Gilles: "Fix tested on WMCS" [puppet] - 10https://gerrit.wikimedia.org/r/506626 (https://phabricator.wikimedia.org/T221401) (owner: 10Gilles) [10:44:01] (03PS7) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [10:44:40] (03CR) 10jerkins-bot: [V: 04-1] sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [10:46:38] (03PS2) 10Ema: cache: hiera setting to list backend services [puppet] - 10https://gerrit.wikimedia.org/r/506636 (https://phabricator.wikimedia.org/T219967) [10:48:18] (03Abandoned) 10Ema: cache: do not set backend_service [puppet] - 10https://gerrit.wikimedia.org/r/506484 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:56:00] RECOVERY - puppet last run on mw1325 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:59:22] (03PS4) 10Arturo Borrero Gonzalez: standard: introduce a wrapper profile [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) [11:00:47] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: add version glue back [puppet] - 10https://gerrit.wikimedia.org/r/506628 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:00:53] (03PS2) 10Jbond: facter3/puppet5: add version glue back [puppet] - 10https://gerrit.wikimedia.org/r/506628 (https://phabricator.wikimedia.org/T219803) [11:00:55] (03PS1) 10Muehlenhoff: Make role::analytics_test_cluster::coordinator a client of our Kerberos realm [puppet] - 10https://gerrit.wikimedia.org/r/506638 [11:02:51] (03PS1) 10Matthias Mullie: Rename UploadWizard depicts/statements toggle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506639 [11:06:35] (03CR) 10Elukey: [C: 03+1] Make role::analytics_test_cluster::coordinator a client of our Kerberos realm [puppet] - 10https://gerrit.wikimedia.org/r/506638 (owner: 10Muehlenhoff) [11:08:32] (03PS5) 10Arturo Borrero Gonzalez: standard: introduce a wrapper profile [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) [11:09:20] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [11:12:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:16:03] (03PS32) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [11:16:05] (03PS19) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [11:16:07] (03PS12) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [11:16:09] (03PS17) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [11:16:49] !log upgrade puppet 4=> 5 and facter 2 => 3 on bast4002, aqs1004 and conf2001 [11:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:27] (03PS1) 10Jbond: facter3/puppet5: upgrade canaries [puppet] - 10https://gerrit.wikimedia.org/r/506643 (https://phabricator.wikimedia.org/T219803) [11:17:58] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: upgrade canaries [puppet] - 10https://gerrit.wikimedia.org/r/506643 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:18:11] (03PS6) 10Arturo Borrero Gonzalez: standard: introduce a wrapper profile [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) [11:19:27] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) [11:21:06] (03PS2) 10Muehlenhoff: Make role::analytics_test_cluster::coordinator a client of our Kerberos realm [puppet] - 10https://gerrit.wikimedia.org/r/506638 [11:21:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:24:41] (03CR) 10Muehlenhoff: [C: 03+2] Make role::analytics_test_cluster::coordinator a client of our Kerberos realm [puppet] - 10https://gerrit.wikimedia.org/r/506638 (owner: 10Muehlenhoff) [11:25:13] (03PS7) 10Arturo Borrero Gonzalez: standard: introduce a wrapper profile [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) [11:26:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:26:56] !log upgrade puppet 4=> 5 and facter 2 => 3 on lvs4007, dns2001 and multatuli [11:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:00] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests, and 2 others: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10elukey) Summary of what would be needed: * kafka-jumbo1001 (A1) -> 10G card + relocation to a 10G rack * kafka-jumbo1002 (A2) -> 1... [11:27:44] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10elukey) a:03Cmjohnson [11:28:54] (03PS1) 10Jbond: facter3/puppet5: upgrade canaries [puppet] - 10https://gerrit.wikimedia.org/r/506646 (https://phabricator.wikimedia.org/T219803) [11:30:02] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: upgrade canaries [puppet] - 10https://gerrit.wikimedia.org/r/506646 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:30:32] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:31:44] (03PS1) 10Muehlenhoff: Move Kerberos Hiera settings to global setting [puppet] - 10https://gerrit.wikimedia.org/r/506647 [11:33:49] (03CR) 10Elukey: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/506647 (owner: 10Muehlenhoff) [11:34:12] (03PS18) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [11:34:26] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:34:38] PS32 haha [11:35:25] yah.. that CR already has kids [11:35:50] and trafficserver: Provide a TLS terminator profile and backend+TLS role can get some beers in the EU [11:36:20] managing a chain of CRs triggers this stuff [11:36:40] jouncebot: now [11:36:40] No deployments scheduled for the next 70 hour(s) and 53 minute(s) [11:37:00] No deployments this week due to easter? [11:38:02] (03PS33) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [11:38:04] (03PS20) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [11:38:06] (03PS13) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [11:38:08] (03PS19) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [11:43:53] (03PS34) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [11:43:55] (03PS21) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [11:43:58] (03PS14) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [11:44:11] (03PS20) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [11:52:57] (03CR) 10Arturo Borrero Gonzalez: "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1002/16084/" [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [11:54:10] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:57:01] (03PS1) 10Jbond: facter3/puppet5: downgrade canaries [puppet] - 10https://gerrit.wikimedia.org/r/506650 (https://phabricator.wikimedia.org/T219803) [11:57:03] (03PS1) 10Jbond: facter3/puppet5: downgrade canaries [puppet] - 10https://gerrit.wikimedia.org/r/506651 (https://phabricator.wikimedia.org/T219803) [11:57:05] (03PS1) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506652 (https://phabricator.wikimedia.org/T219803) [12:00:15] (03PS2) 10Jbond: facter3/puppet5: update interface fact parsing [puppet] - 10https://gerrit.wikimedia.org/r/506651 (https://phabricator.wikimedia.org/T219803) [12:00:29] (03PS2) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506652 (https://phabricator.wikimedia.org/T219803) [12:01:05] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: downgrade canaries [puppet] - 10https://gerrit.wikimedia.org/r/506650 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [12:05:04] (03CR) 10Jbond: "https://puppet-compiler.wmflabs.org/compiler1001/16085/ (still running)" [puppet] - 10https://gerrit.wikimedia.org/r/506651 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [12:09:14] (03PS8) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [12:09:25] (03PS1) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506657 (https://phabricator.wikimedia.org/T219803) [12:09:39] !log upgrade puppet 4=> 5 and facter 2 => 3 on canary hosts: elastic1017, ganeti2001, analytics1042 [12:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:49] (03PS8) 10Arturo Borrero Gonzalez: standard: introduce a wrapper profile [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) [12:09:51] (03PS9) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [12:10:13] (03CR) 10jerkins-bot: [V: 04-1] sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [12:10:25] (03CR) 10Jbond: [C: 03+2] puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506657 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [12:10:59] (03CR) 10jerkins-bot: [V: 04-1] sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [12:13:31] (03Abandoned) 10CDanis: swift-object-replicator: nice it [puppet] - 10https://gerrit.wikimedia.org/r/506540 (owner: 10CDanis) [12:13:58] (03CR) 10CDanis: "thanks! I'll merge this & do a rolling restart of swift-object-replicator on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/506321 (owner: 10CDanis) [12:15:02] (03PS3) 10Ema: cache: hiera setting to list backend services [puppet] - 10https://gerrit.wikimedia.org/r/506636 (https://phabricator.wikimedia.org/T219967) [12:15:04] 10Operations, 10Traffic, 10monitoring: RIPE Atlas data in Prometheus - https://phabricator.wikimedia.org/T221964 (10CDanis) [12:15:20] (03CR) 10Ema: "pcc looks fine https://puppet-compiler.wmflabs.org/compiler1002/16075/" [puppet] - 10https://gerrit.wikimedia.org/r/506636 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [12:15:39] (03PS1) 10Jbond: puppet5/facter3: Revert upgrade until interfaces fact fixed [puppet] - 10https://gerrit.wikimedia.org/r/506658 (https://phabricator.wikimedia.org/T219803) [12:16:32] (03CR) 10Ema: [C: 03+2] cache: hiera setting to list backend services [puppet] - 10https://gerrit.wikimedia.org/r/506636 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [12:20:37] !log repool cp3030 after directors.frontend.vcl testing T219967 [12:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:42] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [12:24:43] (03PS3) 10Jbond: facter3/puppet5: update interface fact parsing [puppet] - 10https://gerrit.wikimedia.org/r/506651 (https://phabricator.wikimedia.org/T219803) [12:24:46] (03PS3) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506652 (https://phabricator.wikimedia.org/T219803) [12:28:37] (03PS21) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [12:28:39] (03PS1) 10Vgutierrez: prometheus: Support several instances of the trafficserver exporter [puppet] - 10https://gerrit.wikimedia.org/r/506659 (https://phabricator.wikimedia.org/T221217) [12:28:41] (03PS9) 10Arturo Borrero Gonzalez: standard: introduce a wrapper profile [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) [12:28:43] (03PS10) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [12:29:40] (03PS4) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506652 (https://phabricator.wikimedia.org/T219803) [12:31:34] (03PS1) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506660 (https://phabricator.wikimedia.org/T219803) [12:31:52] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:32:39] !log upgrade puppet 4=> 5 and facter 2 => 3 on kafka2001.yaml kafka-jumbo1001.yaml kafka1012.yaml [12:32:48] (03CR) 10Jbond: [C: 03+2] puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506660 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [12:33:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "This whole-fleet PCC run is as expected: https://puppet-compiler.wmflabs.org/compiler1002/16084/" [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [12:34:58] Is gerrit being slower than usual for anyone else? I've noticed an increase in threads (but not a major increase but definitely noticeable one). [12:37:04] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:39:04] PROBLEM - Recursive DNS on 208.80.154.20 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/DNS [12:40:05] (03PS22) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [12:40:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "PCC for toolforge: https://puppet-compiler.wmflabs.org/compiler1002/16087/" [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [12:43:37] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10Andrew) [12:44:14] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [12:44:43] !log pool cp4021 w/ ATS backend T219967 [12:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:47] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [12:45:21] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=cache_upload,name=cp4021.ulsfo.wmnet,dc=ulsfo [12:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:25] (03PS23) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [12:48:12] !log upgrade puppet 4=> 5 and facter 2 => 3 on mc1019, maps1001 and logstash1007 [12:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:19] (03PS1) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506661 (https://phabricator.wikimedia.org/T219803) [12:49:41] (03PS10) 10Arturo Borrero Gonzalez: standard: introduce a wrapper profile [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) [12:49:43] (03PS11) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [12:50:16] (03CR) 10Jbond: [C: 03+2] puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506661 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [12:50:33] !log Restarting hhvm on mw1288 [12:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:15] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10aborrero) [12:51:22] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:51:26] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10aborrero) [12:51:51] (03CR) 10Gehel: [C: 04-1] "See comments inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506610 (owner: 10Mathew.onipe) [12:52:22] !log cp4025: restart varnish-be due to mbox lag [12:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:42] RECOVERY - Nginx local proxy to apache on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:53:02] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:53:30] RECOVERY - HHVM rendering on mw1288 is OK: HTTP OK: HTTP/1.1 200 OK - 79669 bytes in 0.810 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:55:14] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:57:38] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:06:36] (03PS1) 10Faidon Liambotis: Add "accounting" report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 [13:06:42] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:07:10] (03CR) 10jerkins-bot: [V: 04-1] Add "accounting" report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 (owner: 10Faidon Liambotis) [13:08:51] (03PS2) 10Faidon Liambotis: Add "accounting" report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 [13:09:18] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:15:08] (03PS2) 10Gehel: elasticsearch: rename "update lag" check to "update rate" [puppet] - 10https://gerrit.wikimedia.org/r/503366 [13:15:33] (03Abandoned) 10Mholloway: Add cron job to update WikimediaEditorTasks suggestions table [puppet] - 10https://gerrit.wikimedia.org/r/500104 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [13:15:53] (03CR) 10Gehel: [C: 03+2] elasticsearch: rename "update lag" check to "update rate" [puppet] - 10https://gerrit.wikimedia.org/r/503366 (owner: 10Gehel) [13:17:04] (03CR) 10Gehel: [C: 03+2] elasticsearch: reset all indices to read/write [cookbooks] - 10https://gerrit.wikimedia.org/r/502220 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [13:17:08] (03PS2) 10Gehel: elasticsearch: reset all indices to read/write [cookbooks] - 10https://gerrit.wikimedia.org/r/502220 (https://phabricator.wikimedia.org/T219799) [13:17:41] (03PS1) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506664 (https://phabricator.wikimedia.org/T219803) [13:17:54] !log upgrade puppet 4=> 5 and facter 2 => 3 on canary hosts: mw1311.yaml, mx2001 & dubnium [13:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:11] 10Operations, 10Patch-For-Review: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10Krenair) [13:18:14] 10Operations, 10Puppet, 10Patch-For-Review: Migrate as much as possible from network::constants from network.pp to hiera - https://phabricator.wikimedia.org/T87519 (10Krenair) [13:18:23] (03CR) 10Jbond: [C: 03+2] puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506664 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:18:36] (03PS2) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506664 (https://phabricator.wikimedia.org/T219803) [13:18:38] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:18:39] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506664 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:19:40] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:20:00] (03CR) 10Alex Monk: standard: introduce a wrapper profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [13:21:29] 10Operations, 10Patch-For-Review: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10Krenair) [13:22:53] 10Operations, 10Patch-For-Review: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10Krenair) [13:26:06] (03CR) 10WMDE-leszek: "AFAIK beta wikis (labs) will get the updated code within 30 minutes after https://gerrit.wikimedia.org/r/#/q/Ibd35823e12cef3f9d6236447a3d5" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) (owner: 10Nray) [13:26:26] (03CR) 10WMDE-leszek: [C: 03+1] Remove wikibase sameAs A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) (owner: 10Nray) [13:26:50] (03PS2) 10Andrew Bogott: Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506566 [13:28:44] (03CR) 10Andrew Bogott: [C: 03+2] Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506566 (owner: 10Andrew Bogott) [13:35:18] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:40:08] (03CR) 10CRusnov: "First pass looks good. Will take a more detailed look in a bit." (033 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 (owner: 10Faidon Liambotis) [13:40:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:40:48] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:42:12] the flapping ulsfo errors are due to 404s from swift being turned into 503s by varnish due to discrepancy between Content-Length and actual content length [13:44:29] (03CR) 10Effie Mouzeli: "See comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506626 (https://phabricator.wikimedia.org/T221401) (owner: 10Gilles) [13:46:36] (03CR) 10Elukey: [C: 04-1] "In my opinion Alex's comments are really good, we shouldn't include any profile in a module (especially since the profile contains hiera l" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [13:48:20] (03CR) 10Faidon Liambotis: Add "accounting" report (033 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 (owner: 10Faidon Liambotis) [13:51:50] (03PS1) 10Alex Monk: Move monitoring_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506672 (https://phabricator.wikimedia.org/T220894) [13:52:28] (03CR) 10jerkins-bot: [V: 04-1] Move monitoring_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506672 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [13:53:56] (03PS2) 10Alex Monk: Move monitoring_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506672 (https://phabricator.wikimedia.org/T220894) [13:56:51] (03PS1) 10Cmjohnson: Removind dns entries for decom hosts pc1004-6 [dns] - 10https://gerrit.wikimedia.org/r/506674 (https://phabricator.wikimedia.org/T210969) [14:00:06] (03PS2) 10Ema: package_builder: install lintian from backports [puppet] - 10https://gerrit.wikimedia.org/r/506598 [14:00:34] (03CR) 10Cmjohnson: [C: 03+2] Removind dns entries for decom hosts pc1004-6 [dns] - 10https://gerrit.wikimedia.org/r/506674 (https://phabricator.wikimedia.org/T210969) (owner: 10Cmjohnson) [14:01:00] (03CR) 10Ema: [C: 03+2] package_builder: install lintian from backports [puppet] - 10https://gerrit.wikimedia.org/r/506598 (owner: 10Ema) [14:02:44] (03PS2) 10Marostegui: Removind dns entries for decom hosts pc1004-6 [dns] - 10https://gerrit.wikimedia.org/r/506674 (https://phabricator.wikimedia.org/T210969) (owner: 10Cmjohnson) [14:04:37] ACKNOWLEDGEMENT - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott this is me working on the fullstack test [14:05:42] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10Marostegui) So DNS is now clean Are the switches ports also cleaned up? T210969#4795686 [14:06:06] (03PS1) 10Ema: Revert "package_builder: install lintian from backports" [puppet] - 10https://gerrit.wikimedia.org/r/506677 [14:09:06] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10Marostegui) [14:10:04] (03PS1) 10Andrew Bogott: Revert "Move labservices1001/1002 to role::spare and clean up" [puppet] - 10https://gerrit.wikimedia.org/r/506678 [14:10:29] (03CR) 10Jbond: [C: 03+1] standard: introduce a wrapper profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [14:11:24] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:12:40] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:12:49] (03PS2) 10Andrew Bogott: Revert "Move labservices1001/1002 to role::spare and clean up" [puppet] - 10https://gerrit.wikimedia.org/r/506678 [14:14:33] (03CR) 10Elukey: [C: 04-1] standard: introduce a wrapper profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [14:15:48] (03PS1) 10Ema: package_builder: move lintian out of require_package [puppet] - 10https://gerrit.wikimedia.org/r/506680 [14:15:49] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Move labservices1001/1002 to role::spare and clean up" [puppet] - 10https://gerrit.wikimedia.org/r/506678 (owner: 10Andrew Bogott) [14:16:36] (03PS1) 10Andrew Bogott: Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506681 [14:18:44] !log Set pc1004-1006 and pc2004-2006 as unracked on netbox - T209858 T210969 [14:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:50] T209858: Decommission parsercache hosts: pc2004 pc2005 pc2006 (Dec 2018 lease return) - https://phabricator.wikimedia.org/T209858 [14:18:51] T210969: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 [14:19:07] (03CR) 10Muehlenhoff: [C: 03+1] "This should fix it." [puppet] - 10https://gerrit.wikimedia.org/r/506680 (owner: 10Ema) [14:21:22] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10Marostegui) [14:21:54] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10Marostegui) I have tried to set these servers as unracked, but I have failed to do so on Netbox (I... [14:22:10] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 (Dec 2018 lease return) - https://phabricator.wikimedia.org/T209858 (10Marostegui) I have tried to set these servers as unracked, but I have failed to do so on Netbox (I guess I don'... [14:22:22] (03PS1) 10Jbond: standard refactor: remove standard class from base classes [puppet] - 10https://gerrit.wikimedia.org/r/506682 (https://phabricator.wikimedia.org/T221225) [14:22:44] (03PS1) 10Elukey: profile::kerberos: make krb.conf working with multiple KDC servers [puppet] - 10https://gerrit.wikimedia.org/r/506683 [14:24:07] (03PS1) 10Andrew Bogott: Keystone: remove ferm rule for keystone in the 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/506684 [14:24:39] (03CR) 10Jbond: [C: 03+1] standard: introduce a wrapper profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [14:24:58] (03CR) 10Jbond: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/16092/console" [puppet] - 10https://gerrit.wikimedia.org/r/506682 (https://phabricator.wikimedia.org/T221225) (owner: 10Jbond) [14:26:44] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: remove ferm rule for keystone in the 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/506684 (owner: 10Andrew Bogott) [14:27:32] (03PS2) 10Ema: package_builder: move lintian out of require_package [puppet] - 10https://gerrit.wikimedia.org/r/506680 [14:28:11] (03CR) 10Ema: [C: 03+2] package_builder: move lintian out of require_package [puppet] - 10https://gerrit.wikimedia.org/r/506680 (owner: 10Ema) [14:28:31] (03PS2) 10Elukey: profile::kerberos: make krb.conf working with multiple KDC servers [puppet] - 10https://gerrit.wikimedia.org/r/506683 [14:29:42] (03CR) 10Bstorm: "There's a huge backlog of improvements that I'm avoiding in this module until I can kill off labstore1003, but this seems safe enough :)" [puppet] - 10https://gerrit.wikimedia.org/r/506331 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [14:30:05] (03PS2) 10Bstorm: labstore::fileserver::exports: convert to systemd service [puppet] - 10https://gerrit.wikimedia.org/r/506331 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [14:30:08] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [14:31:16] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/16096/kerberos1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/506683 (owner: 10Elukey) [14:31:48] (03CR) 10Bstorm: [C: 03+2] labstore::fileserver::exports: convert to systemd service [puppet] - 10https://gerrit.wikimedia.org/r/506331 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [14:33:53] 10Operations, 10Traffic, 10monitoring: RIPE Atlas data in Prometheus - https://phabricator.wikimedia.org/T221964 (10faidon) [14:33:55] 10Operations, 10monitoring: Add RIPE atlas data to Prometheus - https://phabricator.wikimedia.org/T167689 (10faidon) [14:34:03] cdanis: sorry! :) [14:34:13] (03Abandoned) 10Ema: Revert "package_builder: install lintian from backports" [puppet] - 10https://gerrit.wikimedia.org/r/506677 (owner: 10Ema) [14:35:23] (03CR) 10Jbond: [C: 03+1] "This looks much nicer thanks" [puppet] - 10https://gerrit.wikimedia.org/r/506672 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [14:36:27] (03CR) 10Herron: [C: 03+1] ulogd: rename nflog comment [puppet] - 10https://gerrit.wikimedia.org/r/506634 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [14:36:58] (03PS2) 10Jbond: ulogd: rename nflog comment [puppet] - 10https://gerrit.wikimedia.org/r/506634 (https://phabricator.wikimedia.org/T116011) [14:38:39] (03CR) 10Jbond: [C: 03+2] ulogd: rename nflog comment [puppet] - 10https://gerrit.wikimedia.org/r/506634 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [14:41:55] (03CR) 10Herron: [C: 03+1] "Indeed seems unnecessary. Worth a try!" [puppet] - 10https://gerrit.wikimedia.org/r/506579 (owner: 10Dzahn) [14:43:23] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 (Dec 2018 lease return) - https://phabricator.wikimedia.org/T209858 (10Marostegui) Nevermind that comment, I read https://wikitech.wikimedia.org/wiki/Server_Lifecycle#States wrongly.... [14:44:10] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10Marostegui) 05Open→03Resolved I read https://wikitech.wikimedia.org/wiki/Server_Lifecycle#State... [14:44:11] (03PS1) 10Andrew Bogott: Temporary hack: turn off designate on labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/506686 (https://phabricator.wikimedia.org/T221183) [14:45:16] (03CR) 10Andrew Bogott: [C: 03+2] Temporary hack: turn off designate on labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/506686 (https://phabricator.wikimedia.org/T221183) (owner: 10Andrew Bogott) [14:48:02] (03CR) 10Andrew Bogott: "There's one last dependency (at least, I hope it's the last). Newly created VMs seem to depend on labs-recursor0 before getting puppetize" [puppet] - 10https://gerrit.wikimedia.org/r/506681 (owner: 10Andrew Bogott) [14:49:30] (03CR) 10Herron: [C: 03+1] "> per IRC discussion, will need at least an extra sudo privileges" [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) (owner: 10Dzahn) [14:52:04] (03Abandoned) 10Herron: admin: add foks to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/506492 (https://phabricator.wikimedia.org/T220860) (owner: 10Herron) [14:54:04] (03PS2) 10Jbond: standard refactor: remove standard class from base classes [puppet] - 10https://gerrit.wikimedia.org/r/506682 (https://phabricator.wikimedia.org/T221225) [14:55:51] (03CR) 10Jbond: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/16097/" [puppet] - 10https://gerrit.wikimedia.org/r/506682 (https://phabricator.wikimedia.org/T221225) (owner: 10Jbond) [14:57:46] (03CR) 10Faidon Liambotis: [C: 03+2] puppetdb_microservice: Add acceptable facts [puppet] - 10https://gerrit.wikimedia.org/r/506570 (owner: 10CRusnov) [15:05:14] !log upgrade puppet 4=> 5 and facter 2 => 3 on canary hosts: ores1001.yaml wtp1025.yaml rdb1006.yaml [15:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:13] (03PS1) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506691 (https://phabricator.wikimedia.org/T219803) [15:07:03] (03CR) 10Jbond: [C: 03+2] puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506691 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:16:32] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:17:32] (03PS1) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506693 (https://phabricator.wikimedia.org/T219803) [15:18:08] (03CR) 10Jbond: [C: 03+2] puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506693 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:18:15] 10Operations, 10Core Platform Team, 10Multi-Content-Revisions (Reactive), 10Regression, 10Wikimedia-production-error: Unable to move page (Special:MovePage&action=submit) - https://phabricator.wikimedia.org/T221763 (10CCicalese_WMF) [15:18:29] 10Operations, 10Core Platform Team (MCR), 10Multi-Content-Revisions (Reactive), 10Regression, 10Wikimedia-production-error: Unable to move page (Special:MovePage&action=submit) - https://phabricator.wikimedia.org/T221763 (10kchapman) [15:18:48] 10Operations, 10Core Platform Team Backlog, 10Core Platform Team (MCR), 10Multi-Content-Revisions (Reactive), and 2 others: Unable to move page (Special:MovePage&action=submit) - https://phabricator.wikimedia.org/T221763 (10kchapman) [15:18:51] (03PS1) 10BBlack: Fixup content-length on rewrite of thumbor 404s [puppet] - 10https://gerrit.wikimedia.org/r/506694 [15:19:04] 10Operations, 10Core Platform Team (MCR), 10Core Platform Team Backlog (Next), 10Multi-Content-Revisions (Reactive), and 2 others: Unable to move page (Special:MovePage&action=submit) - https://phabricator.wikimedia.org/T221763 (10kchapman) [15:19:13] 10Operations, 10Core Platform Team (MCR), 10Core Platform Team Backlog (Next), 10Multi-Content-Revisions (Reactive), and 2 others: Unable to move page (Special:MovePage&action=submit) - https://phabricator.wikimedia.org/T221763 (10kchapman) a:03Anomie [15:19:13] problem: rdb1006 is me fixing now [15:19:27] 👍 [15:21:32] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:23:45] (03CR) 10Arturo Borrero Gonzalez: "This directly conflict with https://gerrit.wikimedia.org/r/c/operations/puppet/+/506614 we should decide which patch to merge before." [puppet] - 10https://gerrit.wikimedia.org/r/506682 (https://phabricator.wikimedia.org/T221225) (owner: 10Jbond) [15:25:32] (03PS2) 10CRusnov: puppetdb_microservice: Add acceptable facts [puppet] - 10https://gerrit.wikimedia.org/r/506570 [15:27:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/506682 (https://phabricator.wikimedia.org/T221225) (owner: 10Jbond) [15:28:18] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/506682 (https://phabricator.wikimedia.org/T221225) (owner: 10Jbond) [15:29:08] (03CR) 10Alex Monk: "this should make that patch possible within the puppet style conventions, this should probably go first." [puppet] - 10https://gerrit.wikimedia.org/r/506682 (https://phabricator.wikimedia.org/T221225) (owner: 10Jbond) [15:31:57] (03Abandoned) 10BBlack: Fixup content-length on rewrite of thumbor 404s [puppet] - 10https://gerrit.wikimedia.org/r/506694 (owner: 10BBlack) [15:32:10] 10Operations, 10fundraising-tech-ops, 10netops: Network setup for frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T221475 (10cwdent) @ayounsi after talking to @jgreen I'm going to redo the DNS using the wildcard cert to also have the failover cname. Is there a public IP available for the cod... [15:34:01] !log upgrade puppet 4=> 5 and facter 2 => 3 on canary hosts: thumbor1001 ms-fe1005 ms-be1013 scb1001 restbase1007 [15:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:20] (03PS1) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506696 (https://phabricator.wikimedia.org/T219803) [15:35:01] (03CR) 10Jbond: [C: 03+2] puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506696 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:35:10] (03PS2) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506696 (https://phabricator.wikimedia.org/T219803) [15:36:23] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "This should be merged first: https://gerrit.wikimedia.org/r/c/operations/puppet/+/506682" [puppet] - 10https://gerrit.wikimedia.org/r/506614 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [15:36:29] 10Operations, 10Core Platform Team, 10DBA, 10MediaWiki-Database, and 6 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10EvanProdromou) So, it looks like this ticket is done, and we're just waiting for it to go out... [15:37:10] budget owner types: when is the deadline for capex proposals? [15:37:14] (03PS3) 10Faidon Liambotis: Add "accounting" report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 [15:37:31] 10Operations, 10DBA, 10MediaWiki-Database, 10MediaWiki-Logging, and 5 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10EvanProdromou) [15:37:51] (03CR) 10Jbond: "LGTM, wonder if we should clean up the lsof output in a future release but it may make trouble shooting a bit lease clear" [puppet] - 10https://gerrit.wikimedia.org/r/506595 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:37:56] (03CR) 10Jbond: [C: 03+1] Add sd-pam processes to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/506595 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:38:44] 10Operations, 10DBA, 10MediaWiki-Database, 10MediaWiki-Logging, and 5 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Marostegui) I believe so [15:42:59] (03CR) 10Herron: [C: 03+1] "Looks good to me! Pending upload of the elastalert package as discussed on hangout" [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [15:43:24] (03CR) 10CRusnov: Add "accounting" report (033 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 (owner: 10Faidon Liambotis) [15:45:23] (03PS1) 10Jcrespo: mariadb: Productionize db2097 for backup source of s1 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/506697 (https://phabricator.wikimedia.org/T220572) [15:46:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/506682 (https://phabricator.wikimedia.org/T221225) (owner: 10Jbond) [15:49:01] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/506682 (https://phabricator.wikimedia.org/T221225) (owner: 10Jbond) [15:49:55] (03CR) 10Marostegui: [C: 03+1] mariadb: Productionize db2097 for backup source of s1 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/506697 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo) [15:56:49] (03PS7) 10Urbanecm: Prepare initial configuration for initiativeswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) [15:57:22] (03PS2) 10Jcrespo: mariadb: Productionize db2097 for backup source of s1 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/506697 (https://phabricator.wikimedia.org/T220572) [15:58:33] (03CR) 10Urbanecm: Prepare initial configuration for initiativeswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [15:59:33] (03CR) 10Urbanecm: [C: 04-1] "See T221525#5140718" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505765 (https://phabricator.wikimedia.org/T221525) (owner: 10Tulsi Bhagat) [16:08:34] Thanks Urbanecm! [16:08:42] I will fix it soon. [16:08:45] yw Tulsi, happy to help [16:08:56] :) [16:10:45] (03CR) 10Jcrespo: [C: 03+2] mariadb: Productionize db2097 for backup source of s1 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/506697 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo) [16:18:52] !log stop s6 mariadb instance on dbstore2001 T220572 [16:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:56] T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 [16:20:44] 10Operations, 10fundraising-tech-ops, 10netops: Network setup for frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T221475 (10ayounsi) https://github.com/wikimedia/operations-dns/blob/master/templates/152.80.208.in-addr.arpa#L24 `208.80.152.235/28` is free. [16:25:23] PROBLEM - Recursive DNS on 208.80.155.118 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/DNS [16:30:06] I assume that means the icinga plugin didn't work right heh [16:30:24] it's labs-recursor0's IP, so I donno [16:30:28] (03PS2) 10Andrew Bogott: Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506681 [16:30:56] I do seem to get answers from it from the outside world at least [16:31:14] can't reach it from inside, maybe intentional? [16:31:58] also the icinga wikitech link there is misleading, that's for production-y dns stuff and this is the labs-recursor which is completely different I assume [16:32:04] (03CR) 10Andrew Bogott: [C: 03+2] Move labservices1001/1002 to role::spare and clean up [puppet] - 10https://gerrit.wikimedia.org/r/506681 (owner: 10Andrew Bogott) [16:33:45] bblack, labs-recursor0 is being turned off [16:33:49] andrewbogott, ^ [16:34:20] bblack: I downtimed all services on that host but it managed to alert anyway [16:38:33] I guess that alert isn't attached to the host… anyway, I downtimed it as well [17:03:12] (03PS1) 10Cdentinger: Add failover URL and public IP for frmon* [dns] - 10https://gerrit.wikimedia.org/r/506707 (https://phabricator.wikimedia.org/T221475) [17:13:00] (03CR) 10Dzahn: "woohoo. thanks for merging" [puppet] - 10https://gerrit.wikimedia.org/r/506331 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [17:15:39] (03PS3) 10Dzahn: raid: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/506548 (https://phabricator.wikimedia.org/T197873) [17:16:29] (03CR) 10Ayounsi: [C: 04-1] "PTR missing." [dns] - 10https://gerrit.wikimedia.org/r/506707 (https://phabricator.wikimedia.org/T221475) (owner: 10Cdentinger) [17:17:30] 10Operations, 10decommission, 10Patch-For-Review: Decommission labservices1001, 1002 - https://phabricator.wikimedia.org/T221857 (10Andrew) [17:21:35] (03CR) 10Dzahn: [C: 03+2] raid: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/506548 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [17:25:10] (03PS2) 10Cdentinger: Add failover URL and public IP for frmon* [dns] - 10https://gerrit.wikimedia.org/r/506707 (https://phabricator.wikimedia.org/T221475) [17:26:19] (03CR) 10Dzahn: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) (owner: 10Dzahn) [17:26:23] (03CR) 10Dzahn: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) (owner: 10Dzahn) [17:27:51] (03PS1) 10Cdentinger: don't forget ptr records [dns] - 10https://gerrit.wikimedia.org/r/506712 [17:28:45] @seen foks [17:28:45] mutante: Last time I saw foks they were quitting the network with reason: Quit: later! N/A at 4/26/2019 6:13:14 AM (11h15m31s ago) [17:28:50] 10Operations, 10Traffic, 10Core Platform Team Backlog (Next), 10Services (next): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10mobrovac) [17:28:59] 10Operations, 10Traffic, 10Core Platform Team Backlog (Next), 10Services (next): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10mobrovac) [17:29:10] 10Operations, 10Traffic, 10Core Platform Team Backlog (Designing), 10MW-1.33-notes (1.33.0-wmf.17; 2019-02-12), and 6 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) [17:29:13] mutante: hello [17:29:22] I’m commuting at the moment [17:30:02] tzatziki: oh:) what i wanted to ask is .. what is command you had to run .. mwscript changePassword.php ? [17:30:07] (03Abandoned) 10Cdentinger: don't forget ptr records [dns] - 10https://gerrit.wikimedia.org/r/506712 (owner: 10Cdentinger) [17:30:26] tzatziki: of course it has time until later [17:33:36] 10Operations, 10Traffic, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Package libvmod-uuid for Debian - https://phabricator.wikimedia.org/T221977 (10mobrovac) p:05Triage→03Normal [17:33:49] (03PS3) 10Cdentinger: Add failover URL and public IP for frmon* [dns] - 10https://gerrit.wikimedia.org/r/506707 (https://phabricator.wikimedia.org/T221475) [17:33:56] 10Operations, 10Traffic, 10Core Platform Team Backlog (Next), 10Services (next): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10mobrovac) [17:34:00] 10Operations, 10Traffic, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Package libvmod-uuid for Debian - https://phabricator.wikimedia.org/T221977 (10mobrovac) [17:34:02] mutante: yeah I believe so. [17:34:20] I can change emails thru mwmaint [17:36:40] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [17:37:46] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916 [17:38:09] (03CR) 10Dzahn: "now just one more PTR for the one in wmnet" [dns] - 10https://gerrit.wikimedia.org/r/506707 (https://phabricator.wikimedia.org/T221475) (owner: 10Cdentinger) [17:39:41] tzatziki: emails but not passwords? do you use sudo? ideally i would like to see a copy/paste of the command / output. working on that access request.. [17:39:51] 10Operations, 10decommission, 10Patch-For-Review: Decommission labservices1001, 1002 - https://phabricator.wikimedia.org/T221857 (10Andrew) a:05Andrew→03RobH [17:40:38] mutante: I believe only emails, there was some authentication issue for passwords I thought? I don’t use sudo. [17:40:44] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:40:55] I thought you used mwscript tzatziki [17:41:10] I do [17:41:16] that uses sudo [17:41:17] (03CR) 10Andrew Bogott: [C: 04-1] Allow puppet-merge to merge the labs/private repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506582 (https://phabricator.wikimedia.org/T221888) (owner: 10Andrew Bogott) [17:41:22] I don’t really know what it’s doing in the backend heh [17:42:55] (the nginx thing on ulsfo is already recovering) [17:43:38] tzatziki: when you got the time later, maybe can you copy/paste an example somewhere [17:43:52] Ok [17:44:24] ideally also with the way it fails for passwords vs email.. thank you! [17:44:34] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:46:49] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 16 ge 4 daniel_zahn https://phabricator.wikimedia.org/T215411 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [17:49:43] (03PS1) 10Bstorm: cloudstore: fail over ip address via hiera for scratch/maps cloudstore [puppet] - 10https://gerrit.wikimedia.org/r/506714 (https://phabricator.wikimedia.org/T209527) [17:50:05] RECOVERY - Check systemd state on analytics1052 is OK: OK - running: The system is fully operational [17:50:08] !log analytics1052 - reported broken systemd state in Icinga - service mcelog was in state failed - systemctl start mcelog - (T212219 ?) [17:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:12] T212219: wmf-auto-restart fails on certain legacy services - https://phabricator.wikimedia.org/T212219 [17:50:41] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: fail over ip address via hiera for scratch/maps cloudstore [puppet] - 10https://gerrit.wikimedia.org/r/506714 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [17:50:43] (03PS1) 10Andrew Bogott: nova: pool labvirt1001, 1002, 1003, 1004 [puppet] - 10https://gerrit.wikimedia.org/r/506715 (https://phabricator.wikimedia.org/T221141) [17:51:56] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/506707 (https://phabricator.wikimedia.org/T221475) (owner: 10Cdentinger) [17:51:57] andrewbogott you mean cloudvirt? [17:52:23] yes :) [17:52:32] (03PS2) 10Andrew Bogott: nova: pool cloudvirt1001, 1002, 1003, 1004 [puppet] - 10https://gerrit.wikimedia.org/r/506715 (https://phabricator.wikimedia.org/T221141) [17:55:38] (03PS2) 10Bstorm: cloudstore: fail over ip address via hiera for scratch/maps cloudstore [puppet] - 10https://gerrit.wikimedia.org/r/506714 (https://phabricator.wikimedia.org/T209527) [18:06:22] (03CR) 10CRusnov: Add "accounting" report (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506663 (owner: 10Faidon Liambotis) [18:06:25] (03CR) 10Bstorm: [C: 03+2] cloudstore: fail over ip address via hiera for scratch/maps cloudstore [puppet] - 10https://gerrit.wikimedia.org/r/506714 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:07:55] (03PS1) 10CRusnov: Cleanups to the oldhardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/506718 (https://phabricator.wikimedia.org/T220422) [18:10:46] (03CR) 10Bstorm: [C: 03+2] cloudstore: fail over ip address via hiera for scratch/maps cloudstore [puppet] - 10https://gerrit.wikimedia.org/r/506714 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:11:06] 10Operations, 10Operations-Software-Development, 10netops, 10User-crusnov: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10crusnov) a:03crusnov [18:13:01] (03PS1) 10Dzahn: kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 [18:13:20] need to find a place to charge laptop..about to shutdown.. bbiaw :p [18:14:24] (03PS1) 10Andrew Bogott: designate/pdns: added a comment about how to bootstrap the pdns database [puppet] - 10https://gerrit.wikimedia.org/r/506720 (https://phabricator.wikimedia.org/T221106) [18:15:37] (03CR) 10Andrew Bogott: [C: 03+2] designate/pdns: added a comment about how to bootstrap the pdns database [puppet] - 10https://gerrit.wikimedia.org/r/506720 (https://phabricator.wikimedia.org/T221106) (owner: 10Andrew Bogott) [18:17:54] (03PS1) 10Bstorm: cloudstore: fix the interface name and add a comment [puppet] - 10https://gerrit.wikimedia.org/r/506721 (https://phabricator.wikimedia.org/T209527) [18:21:15] (03CR) 10Bstorm: [C: 03+2] cloudstore: fix the interface name and add a comment [puppet] - 10https://gerrit.wikimedia.org/r/506721 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:21:17] 10Operations, 10ops-codfw: scs-c1-codfw : update serial in netbox - https://phabricator.wikimedia.org/T221984 (10RobH) p:05Triage→03Low [18:21:24] (03PS2) 10Bstorm: cloudstore: fix the interface name and add a comment [puppet] - 10https://gerrit.wikimedia.org/r/506721 (https://phabricator.wikimedia.org/T209527) [18:25:19] So I just had a puppet-merge error that I'm not sure how to fix. [18:25:27] https://www.irccloud.com/pastebin/ADvZNnoq/ [18:25:44] andrewbogott: since it's our puppetmasters you might have a solution off the top of your head... [18:25:54] I was running in screen because of unstable internet [18:26:01] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: data reimport on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T220830 (10Smalyshev) 05Open→03Resolved p:05Triage→03Normal [18:27:08] bstorm_: definitely never seen that! I don't know why it would care aout the terminal [18:27:26] Huh. Dunno. That's just on ours [18:28:13] I'm just concerned that means those masters are missing things? [18:28:45] Hm... [18:28:49] * andrewbogott looks for something to merge [18:28:57] that is odd. puppet-merge uses the 'tput' utility just to generate color codes: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/puppetmaster/templates/puppet-merge.erb$9 [18:29:06] Yeah, it didn't manage to get my change [18:29:23] I just did a merge a minute ago and it worked [18:29:24] you can always just export TERM=screen and try again I guess [18:29:36] It tries to colorize the OK/FAIL messages, I wonder if it's as simple as that somehow? [18:29:41] yeah that's what it is andrewbogott [18:29:47] 10Operations, 10ops-codfw: scs-c1-codfw : update serial in netbox - https://phabricator.wikimedia.org/T221984 (10RobH) Also its the newest firmware so not sure why its not reporting the serial number, but its a trivial issue. [18:29:55] cdanis: it only does things if there are pending patches, so hard to retest without having things to actually merge [18:30:14] cdanis: do you know what steps I need to take in order to push out the change if that fails? [18:30:15] It failed because I was in screen I think [18:30:56] bstorm_: When I face this problem I just write some code comments and merge 'em :) [18:31:01] Since it never hurts to write comments [18:31:07] I don't see my latest commit in the log [18:31:08] the calls to tput are the first things in the script -- so failures due to that should mean nothing was merged and it should exit immediately, I think [18:31:13] On labspuppetmaster1001 [18:31:16] try TERM=screen puppet-merge [18:32:56] So the changes are not present on some masters [18:32:58] cdanis: why would it have worked on other puppetmasters but not labpuppetmasters? [18:33:01] Different OS versions? [18:33:26] (03PS4) 10Cdentinger: Add failover URL and public IP for frmon* [dns] - 10https://gerrit.wikimedia.org/r/506707 (https://phabricator.wikimedia.org/T221475) [18:33:46] andrewbogott: different OS versions; the terminfo/termcap files that describe per-terminal control codes probalby aren't there for "screen.xterm-256color" for whatever reason [18:33:47] (03CR) 10jerkins-bot: [V: 04-1] Add failover URL and public IP for frmon* [dns] - 10https://gerrit.wikimedia.org/r/506707 (https://phabricator.wikimedia.org/T221475) (owner: 10Cdentinger) [18:33:57] cdanis: I can do that next time. for now I'm not sure how to recover my changes? [18:34:00] andrewbogott: very arguably a bug in the script that it fails if it can't get those control codes from tput [18:34:07] They are present on prod [18:34:18] But not on the labspuppetmaster hosts [18:34:51] bstorm_: the next time anyone does a working merge your changes will reappear [18:35:00] hrm [18:35:01] bstorm_: AIUI you should be able to go to the labspuppetmaster hosts and run puppet-merge there by hand? [18:35:03] since puppet-merge isn't patchwise, it rebases to a given sha1 [18:35:08] or that :) [18:35:22] If it's that simple, then I'm good. [18:35:55] I was worried that it won't appear. It doesn't matter because it's a change that only affects cloudstore hosts...just don't want to leave things broken [18:36:23] so I'm away from home and without access to prod, but you also should be able to do something like... let me see if I can get this incantation from memory [18:38:17] sudo cumin 'A:puppetmaster' 'cd /var/lib/git/operations/puppet; git rev-parse HEAD' [18:38:28] and make sure that outputs the same SHA1 across all hosts [18:39:49] I can also try changing a comment in puppet and merging it [18:40:05] that's also fine :) [18:42:43] (03PS1) 10Bstorm: cloudstore: make comments more verbose [puppet] - 10https://gerrit.wikimedia.org/r/506725 [18:43:38] (03CR) 10Bstorm: [C: 03+2] cloudstore: make comments more verbose [puppet] - 10https://gerrit.wikimedia.org/r/506725 (owner: 10Bstorm) [18:45:21] All better [18:47:37] (03PS2) 10Dzahn: kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) [18:48:25] 10Operations: puppet-merge shouldn't fail if `tput` doesn't grok your terminal - https://phabricator.wikimedia.org/T221985 (10CDanis) [18:57:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:03:37] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:04:02] !log changing password for Subinsebastien [19:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:29] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:08:25] (03CR) 10Gilles: [C: 03+1] profile::analytics::refinery::repository: use the 'analitics-deploy' user [puppet] - 10https://gerrit.wikimedia.org/r/506609 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [19:17:08] (03PS2) 10Gilles: Fix coal syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/506626 (https://phabricator.wikimedia.org/T221401) [19:17:24] (03CR) 10Gilles: Fix coal syslog logging (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506626 (https://phabricator.wikimedia.org/T221401) (owner: 10Gilles) [19:17:34] (03PS3) 10Gilles: Fix coal syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/506626 (https://phabricator.wikimedia.org/T221401) [19:20:49] PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:21:43] !log varnish-backend-restart on cp4026, evidence of artificial 503s from mbox lag behavior, probably related to the semi-abuse client doing odd 404 traffic to ulsfo that's triggering bugs in swift's rewrite.py .... [19:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:43] that rather dense log entry will make sense to a few people that were discussing it at length in #wikimedia-traffic earlier today anyways, sorry if it's rather obtuse from everyone else's pov :) [19:24:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:27:29] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:27:51] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:29:35] PROBLEM - Check systemd state on ms-be2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:30:38] ty bblack [19:31:02] (03PS1) 10Bstorm: cloudstore: test failover of cloudstore1008 to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/506738 (https://phabricator.wikimedia.org/T209527) [19:31:21] looks much better than anytime in the last couple hours [19:38:41] RECOVERY - Check systemd state on ms-be2014 is OK: OK - running: The system is fully operational [19:38:56] !log changing password for JDiPierro@global [19:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:01] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:43:25] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:47:39] (03PS2) 10Dzahn: base::firewall: add runbooks for check_ferm and check_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873) [19:48:29] RECOVERY - Docker registry HTTPS interface on registry1002 is OK: HTTP OK: HTTP/1.1 200 OK - 2545 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Docker [19:48:48] (03CR) 10jerkins-bot: [V: 04-1] base::firewall: add runbooks for check_ferm and check_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [19:50:13] (03PS3) 10Dzahn: base::firewall: add runbooks for check_ferm and check_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873) [19:51:17] (03CR) 10jerkins-bot: [V: 04-1] base::firewall: add runbooks for check_ferm and check_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [19:52:38] (03PS4) 10Dzahn: base::firewall: add runbooks for check_ferm and check_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873) [19:54:00] (03CR) 10jerkins-bot: [V: 04-1] base::firewall: add runbooks for check_ferm and check_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [19:55:47] RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational [19:55:57] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:56:19] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:59:51] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:00:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:00:42] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10CDanis) [20:06:03] (03CR) 10Dzahn: [C: 03+1] "lgtm, matches the requirements from the ticket for having the fundraising landing page etc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [20:08:59] PROBLEM - Check systemd state on ms-be1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:10:17] RECOVERY - Check systemd state on ms-be1014 is OK: OK - running: The system is fully operational [20:11:16] 10Operations, 10ops-codfw: scs-c1-codfw : update serial in netbox - https://phabricator.wikimedia.org/T221984 (10Papaul) a:05Papaul→03RobH @faidon asked me this question before migrating from Racktables to Netbox. I can not access the serial number of the device. I will have to pull out both cable manage... [20:12:10] (03PS3) 10Dzahn: ldap-admins: add foks, add admin group on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) [20:12:57] (03CR) 10jerkins-bot: [V: 04-1] ldap-admins: add foks, add admin group on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) (owner: 10Dzahn) [20:15:19] !log changing email and password for "Lemon martini@global" [20:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:35] (03PS4) 10Dzahn: ldap-admins: add foks, add admin group on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) [20:20:49] 10Operations, 10ops-codfw: scs-c1-codfw : update serial in netbox - https://phabricator.wikimedia.org/T221984 (10RobH) p:05Low→03Lowest >>! In T221984#5141414, @Papaul wrote: > @faidon asked me this question before migrating from Racktables to Netbox. I can not access the serial number of the device. I w... [20:22:56] (03PS1) 10Dzahn: admins: remove ability to run commands as user 'apache' [puppet] - 10https://gerrit.wikimedia.org/r/506750 [20:27:05] (03PS1) 10Bstorm: cloudstore: add ping check for ip conflict [puppet] - 10https://gerrit.wikimedia.org/r/506751 (https://phabricator.wikimedia.org/T209527) [20:33:26] 10Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144 (10Dzahn) [20:33:28] 10Operations, 10Mail: remove RT mail aliases - https://phabricator.wikimedia.org/T220844 (10Dzahn) 05Open→03Resolved I tested mailing these and it didn't work anyways. Also the queues are not in RT anymore. Just access-requests, procurement and maint-announce. Removed. [20:34:31] (03CR) 10Andrew Bogott: [C: 03+1] "This is not a thing I've seen done but it seems like it should work. Let's see :)" [puppet] - 10https://gerrit.wikimedia.org/r/506751 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [20:38:31] (03CR) 10Bstorm: [C: 03+2] cloudstore: add ping check for ip conflict [puppet] - 10https://gerrit.wikimedia.org/r/506751 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [20:38:47] (03CR) 10Dzahn: "why Invalid commit message now?" [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:42:24] (03CR) 10Bstorm: [C: 03+2] cloudstore: test failover of cloudstore1008 to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/506738 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [20:42:35] (03PS2) 10Bstorm: cloudstore: test failover of cloudstore1008 to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/506738 (https://phabricator.wikimedia.org/T209527) [20:42:44] (03CR) 10Bstorm: [V: 03+2 C: 03+2] cloudstore: test failover of cloudstore1008 to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/506738 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [20:45:21] (03PS1) 10Bstorm: Revert "cloudstore: test failover of cloudstore1008 to cloudstore1009" [puppet] - 10https://gerrit.wikimedia.org/r/506836 [20:49:47] (03CR) 10Bstorm: [C: 03+2] Revert "cloudstore: test failover of cloudstore1008 to cloudstore1009" [puppet] - 10https://gerrit.wikimedia.org/r/506836 (owner: 10Bstorm) [20:57:25] RECOVERY - HP RAID on ms-be2032 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:11:50] (03PS5) 10Cdentinger: Add failover URL and public IP for frmon* [dns] - 10https://gerrit.wikimedia.org/r/506707 (https://phabricator.wikimedia.org/T221475) [21:17:24] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) [21:20:26] (03Abandoned) 10Paladox: Gerrit: Introduce support for SiteNotice in PolyGerrit [puppet] - 10https://gerrit.wikimedia.org/r/488192 (https://phabricator.wikimedia.org/T215323) (owner: 10Paladox) [21:24:35] (03PS3) 10Jbond: standard refactor: remove standard class from base classes [puppet] - 10https://gerrit.wikimedia.org/r/506682 (https://phabricator.wikimedia.org/T221225) [21:24:41] (03PS5) 10Paladox: Gerrit: Update soy templates for gerrit 2.16 [puppet] - 10https://gerrit.wikimedia.org/r/473264 [21:24:43] (03PS6) 10Paladox: Gerrit: Update soy templates for gerrit 2.16 [puppet] - 10https://gerrit.wikimedia.org/r/473264 [21:39:19] (03PS5) 10CDanis: base::firewall: add runbooks for check_ferm and check_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [22:04:26] (03CR) 10Ayounsi: [C: 03+1] Add failover URL and public IP for frmon* [dns] - 10https://gerrit.wikimedia.org/r/506707 (https://phabricator.wikimedia.org/T221475) (owner: 10Cdentinger) [22:10:41] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [22:11:49] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:40:49] 10Operations, 10Puppet: puppet-merge shouldn't fail if `tput` doesn't grok your terminal - https://phabricator.wikimedia.org/T221985 (10herron) p:05Triage→03Normal [22:42:50] 10Operations, 10cloud-services-team: Investigate use of hp-asrd on HPE servers - https://phabricator.wikimedia.org/T221939 (10herron) p:05Triage→03Normal [22:44:05] 10Operations, 10media-storage, 10monitoring: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10herron) p:05Triage→03Normal [22:44:40] 10Operations, 10Puppet, 10puppet-compiler: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10herron) p:05Triage→03Normal [22:57:10] 10Operations, 10netops: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10ayounsi) > After checking the information. > We have created a PR 1433009 and engineering will analyze and will update us on the findings. [23:31:53] 10Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144 (10Dzahn) I have sent email to 26 different people, former board members, former staff etc, asking them if they still use their aliases and are aware of them and at the same time checking if the reci... [23:59:19] 10Operations, 10serviceops: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn)