[00:00:47] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [00:17:52] 06Operations, 10Traffic, 07HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1101271 (10Platonides) >>! In T92002#1879207, @BBlack wrote: > The issue here is that for PKP to assert validity, it's not enough that we're signed by a CA that's on our list... [00:20:30] (03PS5) 10Andrew Bogott: Keystonehooks: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918 [00:21:39] (03CR) 10jerkins-bot: [V: 04-1] Keystonehooks: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918 (owner: 10Andrew Bogott) [00:23:30] (03PS6) 10Andrew Bogott: Keystonehooks: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918 [00:28:37] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:33:53] (03PS7) 10Andrew Bogott: Keystonehooks: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918 (https://phabricator.wikimedia.org/T150091) [00:35:57] (03CR) 10Andrew Bogott: [C: 032] Keystonehooks: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott) [00:39:58] 06Operations, 10hardware-requests, 13Patch-For-Review: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3054564 (10Dzahn) I was able to install jessie on the-server-formerly-known-as-hooft as "bast3002". It did not work over http and over tftp it was still very slow but it did work. ``` Debian... [00:49:57] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 1802.310592 Seconds [00:50:57] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 31.588395 Seconds [00:53:35] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3054590 (10Dzahn) >>! In T151798#3053857, @dpatrick wrote: > I've reviewed both content and technical implementation Thank you! Way more detailed than exp... [00:56:27] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [00:56:32] (03PS1) 10Krinkle: [WIP] mediawiki: Add cache-warmup to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922) [00:56:37] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [00:57:27] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4008488 keys, up 116 days 16 hours - replication_delay is 0 [00:58:11] (03PS2) 10Krinkle: [WIP] mediawiki: Add cache-warmup to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922) [01:01:41] (03PS1) 10Dzahn: annualreport: add X-Frame-Options header to Apache config [puppet] - 10https://gerrit.wikimedia.org/r/339803 (https://phabricator.wikimedia.org/T151798) [01:01:55] (03CR) 10jerkins-bot: [V: 04-1] annualreport: add X-Frame-Options header to Apache config [puppet] - 10https://gerrit.wikimedia.org/r/339803 (https://phabricator.wikimedia.org/T151798) (owner: 10Dzahn) [01:03:23] (03PS2) 10Dzahn: annualreport: add X-Frame-Options header to Apache config [puppet] - 10https://gerrit.wikimedia.org/r/339803 (https://phabricator.wikimedia.org/T151798) [01:04:33] (03CR) 10Dzahn: "https://geekflare.com/secure-apache-from-clickjacking-with-x-frame-options/" [puppet] - 10https://gerrit.wikimedia.org/r/339803 (https://phabricator.wikimedia.org/T151798) (owner: 10Dzahn) [01:05:08] (03CR) 10Dzahn: [C: 032] annualreport: add X-Frame-Options header to Apache config [puppet] - 10https://gerrit.wikimedia.org/r/339803 (https://phabricator.wikimedia.org/T151798) (owner: 10Dzahn) [01:11:35] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3054621 (10Dzahn) >>! In T151798#3053857, @dpatrick wrote: > * [[ https://www.owasp.org/index.php/OWASP_Secure_Headers_Project#X-Frame-Options | X-Frame-Op... [01:35:07] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1801.311485 Seconds [01:36:07] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 42.38049 Seconds [01:39:30] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3054637 (10Dzahn) a:05ZMcCune>03Dzahn [01:43:26] !log bast3002 - sign puppet cert, initial run with basic "bastion" role, to replace broken bast3001, but WIP, ganglia/prometheus roles not moved yet (T156506) [01:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:32] T156506: Replace bast3001 - https://phabricator.wikimedia.org/T156506 [01:48:43] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3054655 (10Dzahn) Old index redirect is cached but that's known and ok that way until Monday. [01:56:04] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3048354 (10Dzahn) @Muehlenhoff ^ fyi,re: sudo permissions not appearing in the data.yaml [01:57:21] 06Operations, 10hardware-requests, 13Patch-For-Review: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3054667 (10Dzahn) ``` [bast3002:~] $ gen_fingerprints +---------+---------+-------------------------------------------------+ | Cipher | Algo | Fingerprint... [02:00:49] 06Operations, 10hardware-requests, 13Patch-For-Review: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3054668 (10Dzahn) Next is https://gerrit.wikimedia.org/r/#/c/339684/ and moving the roles: installserver::tftp prometheus::ops ganglia::monitor::aggregator from 3001 to 3002, then shutt... [02:19:49] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.13) (duration: 07m 20s) [02:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:01] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 1800.474118 Seconds [02:21:01] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 21.463512 Seconds [02:25:11] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Feb 25 02:25:10 UTC 2017 (duration 5m 21s) [02:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:31] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 22 probes of 269 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [02:42:31] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 16 probes of 269 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [02:53:01] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:22:01] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [03:23:51] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 617.10 seconds [03:26:51] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 205.11 seconds [04:05:56] (03CR) 10Tim Landscheidt: "On Precise Labs instances, this gives:" [puppet] - 10https://gerrit.wikimedia.org/r/339231 (owner: 10Faidon Liambotis) [05:00:09] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:41:09] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:07:09] PROBLEM - puppet last run on radium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:09:09] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:31:29] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:36:09] RECOVERY - puppet last run on radium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:59:29] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:29:49] PROBLEM - puppet last run on elastic1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:57:49] RECOVERY - puppet last run on elastic1039 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:23:29] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.129 second response time [08:50:29] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.144 second response time [09:05:43] (03Abandoned) 10Giuseppe Lavagetto: Allow defining the conftool entities via a schema file [software/conftool] - 10https://gerrit.wikimedia.org/r/278892 (owner: 10Giuseppe Lavagetto) [09:06:20] (03Abandoned) 10Giuseppe Lavagetto: realm: convert main_ipaddress and site into facts [puppet] - 10https://gerrit.wikimedia.org/r/311223 (https://phabricator.wikimedia.org/T85459) (owner: 10Giuseppe Lavagetto) [10:19:59] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 1809.867048 Seconds [10:20:59] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 28.270515 Seconds [10:36:59] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:59] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [11:10:39] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:23:29] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.166 second response time [11:34:29] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 376.26 seconds [11:38:39] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [11:50:29] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.185 second response time [12:09:09] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:09] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:24:29] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 34.21 seconds [12:37:09] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [12:43:04] Reedy: ping [12:48:09] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:49:28] Krinkle: around? [12:57:29] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:01:19] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:01:19] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:01:39] PROBLEM - zotero on sca1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:00] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:09] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:09] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [13:02:12] maybe somebody could *dry-run* namespaceDupes.php for ext.wikipedia to check a thing? [13:02:19] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:02:21] tnx [13:02:29] RECOVERY - zotero on sca1003 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.006 second response time [13:02:59] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:02:59] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [13:27:29] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:15:05] 06Operations, 10Phabricator, 06Release-Engineering-Team: Update file phab_epipe.py to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159043#3055204 (10Paladox) [14:21:51] 06Operations, 10Phabricator: Update phabricator.py file to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159045#3055228 (10Paladox) [14:26:00] PROBLEM - puppet last run on prometheus2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:33:09] PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:46:19] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:47:59] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:48:19] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:48:49] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [14:49:09] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:54:00] RECOVERY - puppet last run on prometheus2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:01:09] RECOVERY - puppet last run on mw1192 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:01:59] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:14:19] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:16:09] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:59] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:38:29] PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:09] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:06:29] RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:06:49] PROBLEM - puppet last run on db1078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:17:09] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 24 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:22:09] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 13 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:30:08] (03PS3) 10Tim Landscheidt: Tools: Outfactor jobkill script to toollabs::node::all [puppet] - 10https://gerrit.wikimedia.org/r/335755 [16:32:43] (03PS4) 10Tim Landscheidt: Redirect wiki.toolserver.org to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [16:35:49] RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:36:08] 06Operations: reset admin password for Wikimania-l - https://phabricator.wikimedia.org/T159048#3055305 (10Paladox) [16:37:01] (03PS2) 10Tim Landscheidt: puppet: Remove templatedir setting [puppet] - 10https://gerrit.wikimedia.org/r/338540 (https://phabricator.wikimedia.org/T95158) [16:47:49] PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:48:05] (03CR) 10Paladox: [C: 031] puppet: Remove templatedir setting [puppet] - 10https://gerrit.wikimedia.org/r/338540 (https://phabricator.wikimedia.org/T95158) (owner: 10Tim Landscheidt) [16:50:49] PROBLEM - Check systemd state on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:49] PROBLEM - Check size of conntrack table on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:51:09] PROBLEM - SSH on bast3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:51:19] PROBLEM - configured eth on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:51:19] PROBLEM - Check whether ferm is active by checking the default input chain on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:52:59] RECOVERY - SSH on bast3001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [16:53:20] RECOVERY - configured eth on bast3001 is OK: OK - interfaces up [16:53:20] RECOVERY - Check whether ferm is active by checking the default input chain on bast3001 is OK: OK ferm input default policy is set [16:53:49] PROBLEM - dhclient process on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:55:39] RECOVERY - dhclient process on bast3001 is OK: PROCS OK: 0 processes with command name dhclient [16:55:39] RECOVERY - Check systemd state on bast3001 is OK: OK - running: The system is fully operational [16:55:39] RECOVERY - Check size of conntrack table on bast3001 is OK: OK: nf_conntrack is 0 % full [16:55:39] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 39 minutes ago with 0 failures [16:56:19] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:24:19] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:41:39] (03PS1) 10Volans: Fix additional minor issues reported by codacy [software/cumin] - 10https://gerrit.wikimedia.org/r/339833 (https://phabricator.wikimedia.org/T158967) [17:41:41] (03PS1) 10Volans: Make docstring pep257 compliant [software/cumin] - 10https://gerrit.wikimedia.org/r/339834 (https://phabricator.wikimedia.org/T158967) [17:48:02] (03PS2) 10Volans: Make docstring pep257 compliant [software/cumin] - 10https://gerrit.wikimedia.org/r/339834 (https://phabricator.wikimedia.org/T158967) [17:51:21] (03PS2) 10Volans: Fix additional minor issues reported by codacy [software/cumin] - 10https://gerrit.wikimedia.org/r/339833 (https://phabricator.wikimedia.org/T158967) [17:52:57] (03CR) 10Volans: [C: 032] Fix additional minor issues reported by codacy [software/cumin] - 10https://gerrit.wikimedia.org/r/339833 (https://phabricator.wikimedia.org/T158967) (owner: 10Volans) [17:54:08] (03Merged) 10jenkins-bot: Fix additional minor issues reported by codacy [software/cumin] - 10https://gerrit.wikimedia.org/r/339833 (https://phabricator.wikimedia.org/T158967) (owner: 10Volans) [18:02:01] 06Operations, 10Phabricator, 07Technical-Debt: Update phabricator.py file to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159045#3055365 (10Aklapper) [18:02:08] 06Operations, 10Phabricator, 06Release-Engineering-Team, 07Technical-Debt: Update file phab_epipe.py to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159043#3055367 (10Aklapper) [18:02:15] (03PS3) 10Volans: Make docstring pep257 compliant [software/cumin] - 10https://gerrit.wikimedia.org/r/339834 (https://phabricator.wikimedia.org/T158967) [18:03:13] 06Operations, 10Phabricator, 07Technical-Debt: Update wmf_auto_reimage.py file to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159045#3055371 (10Paladox) [18:06:22] (03PS4) 10Volans: Make docstring pep257 compliant [software/cumin] - 10https://gerrit.wikimedia.org/r/339834 (https://phabricator.wikimedia.org/T158967) [18:08:42] (03CR) 10Volans: [C: 032] Make docstring pep257 compliant [software/cumin] - 10https://gerrit.wikimedia.org/r/339834 (https://phabricator.wikimedia.org/T158967) (owner: 10Volans) [18:09:18] (03Merged) 10jenkins-bot: Make docstring pep257 compliant [software/cumin] - 10https://gerrit.wikimedia.org/r/339834 (https://phabricator.wikimedia.org/T158967) (owner: 10Volans) [18:11:36] 06Operations, 06Operations-Software-Development, 10Phabricator, 07Technical-Debt: Update wmf_auto_reimage.py file to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159045#3055379 (10Volans) a:03Volans [18:17:57] 06Operations, 06Operations-Software-Development, 10Phabricator, 07Technical-Debt: Update wmf_auto_reimage.py file to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159045#3055383 (10Volans) p:05Triage>03Normal [18:19:32] paladox: seems that also createtask is frozen and will be deprecated [18:19:43] Yep [18:19:58] then I need another task for raid_handler.py... let me open it to not forget [18:20:09] Ok [18:20:23] volans you could just edit the one i created to say replace all deprecated code [18:20:47] for those 2 yeah, make sense, given that are both 'mine' :) [18:21:00] Yep :) [18:21:17] volans i've been trying to figure out how to replace it in its-phabricator (gerrit) [18:21:35] I came up with a solution like [18:21:43] https://gerrit-review.googlesource.com/#/c/98576/ [18:22:09] but it dosen't build locally, fails so i need to fix that too :) [18:22:09] PROBLEM - puppet last run on mc1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:23:00] lol, ok I will take a look when doing it, I guess is not that urgent given that it's now frozen, than will be deprecated and then removed [18:25:50] 06Operations, 06Operations-Software-Development, 10Phabricator, 07Technical-Debt: Update Puppet repo code that uses maniphest.edit conduit api - https://phabricator.wikimedia.org/T159045#3055400 (10Volans) [18:27:16] 06Operations, 06Operations-Software-Development, 10Phabricator, 07Technical-Debt: Update Puppet repo code that uses maniphest.update and maniphest.createtask conduit api - https://phabricator.wikimedia.org/T159045#3055404 (10Paladox) [18:27:43] thanks for the news [18:29:50] Your welcome :) [18:43:59] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:09] RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:58:27] 06Operations, 10Revision-Scoring-As-A-Service-Backlog: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3055435 (10Halfak) I'd thought that maybe we could partition requests to limit our need for replication. E.g. even rev_ids go to eqiad and off rev_ids go to codfw. That way, w... [19:10:59] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:17:29] PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:32:09] PROBLEM - Host cp2017 is DOWN: PING CRITICAL - Packet loss = 100% [19:36:59] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:36:59] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:37:09] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:37:09] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:37:09] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:37:19] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [19:37:39] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:37:39] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [19:37:39] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [19:37:39] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [19:37:49] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [19:37:49] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [19:37:49] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [19:37:49] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [19:37:59] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:37:59] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:37:59] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:37:59] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:37:59] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:37:59] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:37:59] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:38:00] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:38:00] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:38:01] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:38:01] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:38:07] any ops ^^? [19:38:09] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [19:38:09] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [19:38:09] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [19:38:19] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [19:39:06] (03PS5) 10Tim Landscheidt: Tools: Outfactor the configuration for outgoing HBA connections [puppet] - 10https://gerrit.wikimedia.org/r/267832 [19:39:19] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [19:39:39] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [19:47:29] RECOVERY - puppet last run on mw1298 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:54:05] !log powercycled cp2017, mgmt console stuck [19:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:39] RECOVERY - Host cp2017 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [19:55:39] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [19:55:39] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [19:55:40] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [19:55:49] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [19:55:49] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [19:55:49] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [19:55:49] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [19:55:59] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [19:55:59] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [19:55:59] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [19:55:59] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [19:55:59] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [19:55:59] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [19:55:59] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [19:56:00] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [19:56:00] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [19:56:01] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [19:56:01] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [19:56:02] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 54 ESP OK [19:56:19] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [19:56:19] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [19:56:39] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [19:57:32] the host seems up and running, but not sure what is best (pooled=no for investigation or leave it running) [19:59:26] https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=cp2017 - host frozen, not seeing anything weird in metrics before the stop [20:01:58] varnishlog shows 200s [20:03:49] I'd be tempted to executed the depool command (that basically calls confctl) [20:03:58] ema: any chance you are there? [20:04:02] (or bblack) [20:06:46] !log depooled cp2017 (via local sudo -i depool command) since the host froze (it got back after a powercycle) [20:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:29] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [20:08:39] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:08:49] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [20:09:09] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:09:31] <_joe_> some network issue? [20:10:34] <_joe_> ok no [20:10:39] 06Operations, 10Traffic: cp2017 froze and stopped serving traffic - https://phabricator.wikimedia.org/T159056#3055512 (10elukey) [20:10:47] <_joe_> it's just that etcdmirror failed because of my experiments [20:11:03] :) [20:11:07] Ciao _joe_ [20:11:30] <_joe_> something/someone tried to depool a server including the non-active services [20:11:35] <_joe_> oh IT WAS YOU [20:11:47] * _joe_ blames elukey [20:11:53] I ran depool! :P [20:12:39] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:12:41] <_joe_> :P [20:12:45] <_joe_> let me ack that [20:14:15] ACKNOWLEDGEMENT - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Giuseppe Lavagetto Etcd replica is broken because of my experiments with conftool. [20:14:15] ACKNOWLEDGEMENT - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused Giuseppe Lavagetto Etcd replica is broken because of my experiments with conftool. [20:14:15] ACKNOWLEDGEMENT - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed Giuseppe Lavagetto Etcd replica is broken because of my experiments with conftool. [20:14:32] <_joe_> ok, cool [20:37:10] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:41:39] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [21:04:18] (03PS7) 10Tim Landscheidt: Tools: Fix argument quoting in jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266935 [21:07:17] (03PS4) 10Tim Landscheidt: postgresql: Only set user password if different [puppet] - 10https://gerrit.wikimedia.org/r/329328 [21:53:29] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.199 second response time [22:13:29] PROBLEM - puppet last run on mc1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:20:29] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.185 second response time [22:38:29] PROBLEM - puppet last run on db1085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:41:29] RECOVERY - puppet last run on mc1009 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [23:06:29] RECOVERY - puppet last run on db1085 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [23:38:28] (03CR) 10Tim Landscheidt: "(Did test it some time ago, works fine.)" [puppet] - 10https://gerrit.wikimedia.org/r/326892 (owner: 10Tim Landscheidt) [23:39:20] (03PS7) 10Tim Landscheidt: Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892 [23:39:22] (03PS5) 10Tim Landscheidt: Tools: Generate node sets dynamically [puppet] - 10https://gerrit.wikimedia.org/r/328030