[00:01:28] (03PS1) 10Dzahn: remove endowment.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/381904 (https://phabricator.wikimedia.org/T136735) [00:02:36] (03CR) 10Dzahn: [C: 032] remove endowment.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/381904 (https://phabricator.wikimedia.org/T136735) (owner: 10Dzahn) [00:04:20] (03CR) 10Chad: [C: 031] add releases-jenkins to misc-web cluster [dns] - 10https://gerrit.wikimedia.org/r/381903 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [00:04:34] (03PS2) 10Dzahn: add releases-jenkins to misc-web cluster [dns] - 10https://gerrit.wikimedia.org/r/381903 (https://phabricator.wikimedia.org/T164030) [00:22:01] (03PS1) 10Dzahn: add releases-jenkins apache/varnish, move jenkins proxy config [puppet] - 10https://gerrit.wikimedia.org/r/381907 (https://phabricator.wikimedia.org/T164030) [00:29:18] bd808: the docker image has ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8' [00:29:55] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-gerrit] [00:31:06] ACKNOWLEDGEMENT - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-gerrit] daniel_zahn https://phabricator.wikimedia.org/T176532 [00:39:40] legoktm: huh. that seems like it should work then, but maybe I'm thinking about the error incorrectly. [00:43:03] 10Operations, 10Edit-Review-Improvements, 10Collaboration-Feature-Rollouts (Collaboration-WL-Graduated-Everywhere), 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), 10Performance: Systematically test load speeds of Watchlist and Recent Changes - https://phabricator.wikimedia.org/T176445#3652741 (... [00:46:28] oh... \xe2 is â in latin1. That's a sure sign of reading utf-8 as latin1. [00:52:56] u'…'.encode('utf-8') == '\xe2\x80\xa6' [00:58:21] (03CR) 10Chad: [C: 031] "So basically we'll still install it on both, but Varnish will only talk to the active one? That makes sense to me--means the non-active on" [puppet] - 10https://gerrit.wikimedia.org/r/381907 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [01:00:17] (03CR) 10Dzahn: "yea, first i thought we'd have an "if" in puppet code but when actually doing it i changed my mind because.. basically what you said" [puppet] - 10https://gerrit.wikimedia.org/r/381907 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [01:05:04] legoktm, paravoid: looks like that bug is fixed in the repo but a new version hasn't been cut yet -- https://gerrit.wikimedia.org/r/#/c/369472/ [01:07:07] mutante: Basically swap would become 1) swap varnish, 2) swap which machine is running the jobs [01:07:14] (bonus points if I can script the latter!) [01:07:50] Probably could do something fun with $::fqdn and such [01:37:52] 10Operations, 10Edit-Review-Improvements, 10Collaboration-Feature-Rollouts (Collaboration-RC-Graduated-Everywhere), 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), 10Performance: Systematically test load speeds of Watchlist and Recent Changes - https://phabricator.wikimedia.org/T176445#3652756 (... [01:38:39] 10Operations, 10Edit-Review-Improvements, 10Collaboration-Feature-Rollouts (Collaboration-WL-Graduated-Everywhere), 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), 10Performance: Systematically test load speeds of Watchlist and Recent Changes - https://phabricator.wikimedia.org/T176445#3625823 (... [01:39:22] (03CR) 10Bmansurov: "Thanks, Mforns, for the review. I'll ping Tilman with your questions. And yes 10% is right." [puppet] - 10https://gerrit.wikimedia.org/r/379829 (https://phabricator.wikimedia.org/T175395) (owner: 10Bmansurov) [01:53:36] PROBLEM - HP RAID on db1092 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:8 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: OK [01:53:37] ACKNOWLEDGEMENT - HP RAID on db1092 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:8 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T177264 [01:53:41] 10Operations, 10ops-eqiad: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3652772 (10ops-monitoring-bot) [02:23:51] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.1) (duration: 07m 22s) [02:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:30] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Oct 3 02:30:30 UTC 2017 (duration 6m 39s) [02:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:36] (03PS2) 10Dzahn: restbase-ng: Collect type=Table (not ColumnFamily) [puppet] - 10https://gerrit.wikimedia.org/r/381884 (https://phabricator.wikimedia.org/T169936) (owner: 10Eevans) [02:43:19] (03CR) 10Dzahn: [C: 032] restbase-ng: Collect type=Table (not ColumnFamily) [puppet] - 10https://gerrit.wikimedia.org/r/381884 (https://phabricator.wikimedia.org/T169936) (owner: 10Eevans) [03:19:25] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:55] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.79 ms [03:23:25] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.19 [keeping static files] (duration: 01m 30s) [03:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:25] PROBLEM - HHVM rendering on mw2122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:16] RECOVERY - HHVM rendering on mw2122 is OK: HTTP OK: HTTP/1.1 200 OK - 75593 bytes in 0.554 second response time [05:13:13] 10Operations, 10MediaWiki-Platform-Team, 10TechCom-RfC, 10HHVM, 10NewPHP: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3652895 (10Joe) [05:18:16] (03CR) 10Giuseppe Lavagetto: [C: 032] Rakefile: add wmf styleguide checks [puppet] - 10https://gerrit.wikimedia.org/r/381764 (owner: 10Giuseppe Lavagetto) [05:18:24] (03PS2) 10Giuseppe Lavagetto: Rakefile: add wmf styleguide checks [puppet] - 10https://gerrit.wikimedia.org/r/381764 [05:21:23] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/381647 (owner: 10Hashar) [05:22:02] (03CR) 10jerkins-bot: [V: 04-1] Convert zuul::server to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381647 (owner: 10Hashar) [05:29:16] <_joe_> meh [05:43:43] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2049 and db2041" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381926 [05:43:47] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2049 and db2041" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381926 [05:47:05] 10Operations, 10ops-codfw, 10DBA: db2044 HW RAID failure - https://phabricator.wikimedia.org/T174764#3652923 (10Marostegui) 05Open>03Resolved Thank you I have upgraded the server and as everything looked good, I have started MySQL again. Also cleaned HW logs so we can start fresh. MySQL is catching up n... [05:47:42] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2049 and db2041" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381926 (owner: 10Marostegui) [05:50:41] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2049 and db2041" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381926 (owner: 10Marostegui) [05:50:52] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2049 and db2041" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381926 (owner: 10Marostegui) [05:51:52] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2049 and db2041 - T174509 (duration: 00m 47s) [05:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:58] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:54:29] !log Optimize templatelinks and pagelinks tables on s5 master codfw (db2023): this will generate lag - T174509 [05:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:45] is phabricator down for anyone else? [05:57:08] works for me legoktm [05:57:23] aaand started working again [05:57:29] thanks :) [06:09:56] (03PS1) 10Giuseppe Lavagetto: Rakefile: fix wmf_style_delta check for detached head situations [puppet] - 10https://gerrit.wikimedia.org/r/381928 [06:10:46] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: fix wmf_style_delta check for detached head situations [puppet] - 10https://gerrit.wikimedia.org/r/381928 (owner: 10Giuseppe Lavagetto) [06:12:08] (03PS2) 10Giuseppe Lavagetto: Rakefile: fix wmf_style_delta check for detached head situations [puppet] - 10https://gerrit.wikimedia.org/r/381928 [06:20:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381929 (https://phabricator.wikimedia.org/T172679) [06:20:45] (03PS3) 10Giuseppe Lavagetto: Rakefile: fix wmf_style_delta check for detached head situations [puppet] - 10https://gerrit.wikimedia.org/r/381928 [06:26:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381929 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [06:27:52] (03CR) 10Giuseppe Lavagetto: [C: 032] Rakefile: fix wmf_style_delta check for detached head situations [puppet] - 10https://gerrit.wikimedia.org/r/381928 (owner: 10Giuseppe Lavagetto) [06:27:54] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/381646 (owner: 10Hashar) [06:28:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381929 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [06:28:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381929 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [06:28:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1066 - T172679 (duration: 00m 47s) [06:28:15] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.002 second response time [06:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:17] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [06:29:15] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1661 bytes in 0.064 second response time [06:30:15] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:30:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [06:31:01] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3652981 (10Marostegui) [06:31:15] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3652772 (10Marostegui) @Cmjohnson feel free to change this disk whenever you can Thanks! [06:31:52] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3652772 (10Marostegui) p:05Triage>03Normal [06:44:05] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:45:25] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:46:53] !log Drop now redundant indexes from pagelinks and templatelinks on s7 - T174509 [06:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:58] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [07:01:15] !log Optimize templatelinks and pagelinks tables on s6 codfw master db2028 (this might generate lag) - T174509 [07:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:20] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [07:03:54] !log Optimize pagelinks and templatelinks tables on labsdb1010 for s5 - T174509 [07:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:05] (03PS1) 10Giuseppe Lavagetto: gitignore: add standard directory for bundle creation [puppet] - 10https://gerrit.wikimedia.org/r/381934 [07:05:07] (03PS1) 10Giuseppe Lavagetto: utils: remove `linter` script [puppet] - 10https://gerrit.wikimedia.org/r/381935 [07:05:09] (03PS1) 10Giuseppe Lavagetto: utils/git-setup: add installation of the post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/381936 [07:05:34] !log restart varnish backend on cp3031 (503s) [07:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:43] !log killing stuck tileshell and cleaning up /srv/osm_expire on maps-test2001 - T175123 [07:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:49] T175123: tileshell does not honor redis configuration in /etc/tilerator/config.yaml - https://phabricator.wikimedia.org/T175123 [07:52:28] !log mobrovac@tin Started deploy [restbase/deploy@65f519b]: Create new buckets for feed end points [07:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for ores_classification cleanup deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171003T0800). [08:00:04] No GERRIT patches in the queue for this window AFAICS. [08:02:24] !log mobrovac@tin Finished deploy [restbase/deploy@65f519b]: Create new buckets for feed end points (duration: 09m 57s) [08:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:34] !log Deplo alter table on s3 codfw master (db2018) to add PK to the main tables - T163912 [08:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:39] T163912: Convert unique keys into primary keys for some wiki tables on s3-eqiad and s3-codfw - https://phabricator.wikimedia.org/T163912 [08:07:29] (03PS1) 10Elukey: role::mariadb::analytics: add logrotate fo eventlogging_sync's log [puppet] - 10https://gerrit.wikimedia.org/r/381942 (https://phabricator.wikimedia.org/T168303) [08:07:46] I'm around now [08:08:05] sorry for being late [08:10:55] !log cleaning up ores_classification tables (T159753) [08:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:00] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [08:11:17] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/8149/" [puppet] - 10https://gerrit.wikimedia.org/r/381942 (https://phabricator.wikimedia.org/T168303) (owner: 10Elukey) [08:12:31] (03CR) 10Jcrespo: [C: 031] role::mariadb::analytics: add logrotate fo eventlogging_sync's log [puppet] - 10https://gerrit.wikimedia.org/r/381942 (https://phabricator.wikimedia.org/T168303) (owner: 10Elukey) [08:19:46] (03CR) 10Elukey: [C: 032] role::mariadb::analytics: add logrotate fo eventlogging_sync's log [puppet] - 10https://gerrit.wikimedia.org/r/381942 (https://phabricator.wikimedia.org/T168303) (owner: 10Elukey) [08:21:26] PROBLEM - Apache HTTP on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:21:35] _joe_ shall I merge your patch ? [08:22:18] <_joe_> elukey: which one? [08:22:25] RECOVERY - Apache HTTP on mw2215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.113 second response time [08:22:30] <_joe_> oh yeah, I didn't merge it as it's just for CI [08:22:31] <_joe_> :P [08:22:33] <_joe_> sorry [08:22:34] <_joe_> go on [08:23:12] ack! [08:30:53] (03PS4) 10Filippo Giunchedi: hieradata: enable syslog over tls for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/381768 (https://phabricator.wikimedia.org/T136312) [08:30:56] (03PS1) 10Filippo Giunchedi: base: check remote_syslog and remote_syslog_tls for emptyness [puppet] - 10https://gerrit.wikimedia.org/r/381944 (https://phabricator.wikimedia.org/T136312) [08:43:58] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/8150/" [puppet] - 10https://gerrit.wikimedia.org/r/381944 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [08:46:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:48:02] this one seems a small spike not belonging to a specific varnish backend [08:48:26] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0 [08:48:26] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [08:49:31] <_joe_> maybe related? ^^ [08:49:58] I don't see planned maintenance in the calendare [08:50:01] *calendar [08:51:00] (03CR) 10Volans: [C: 04-1] "A bunch of comments inline, most of them from flake8" (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/378039 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [08:51:16] elukey: don't laugh! [08:51:21] :-P [08:51:57] hahahahah [08:53:21] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/381944 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [08:53:55] (03CR) 10Volans: [C: 031] "LGTM, to be merged after Ic10c81b7e5ab46f86c509073abdab433040ff87c" [puppet] - 10https://gerrit.wikimedia.org/r/381768 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [08:54:32] so the interface down is the connection between cr1-ulsfo and eqord [08:54:42] but the majority of 503s were esams ones [08:56:25] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:00:45] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [09:00:45] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [09:01:26] so the ports down seems managed by Telia (very ignorant reading from librenms), probably unscheduled maintenance or similar [09:01:37] O [09:01:44] I'd say not related to the 503s [09:08:48] (03PS4) 10Alexandros Kosiaris: contint: move slave and website roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/381632 (owner: 10Hashar) [09:08:53] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] contint: move slave and website roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/381632 (owner: 10Hashar) [09:14:04] <_joe_> akosiaris: so I've seen your comments on https://gerrit.wikimedia.org/r/#/c/379729 [09:14:18] <_joe_> I'm inclined to fix most of those in subsequent patchsets [09:14:24] <_joe_> if that's ok with you [09:14:43] <_joe_> that patch is already too large for my taste [09:15:02] <_joe_> I mean we could go the other way around, split that patch in smaller bits [09:15:14] <_joe_> probably that's the best way to go [09:15:41] hehehe [09:15:50] <_joe_> what do you think? [09:16:29] that I 'd rather we did not violate our guidelines and rather do it correctly now [09:16:48] it's not like we need to cut corners [09:17:21] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comment, rest LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/381646 (owner: 10Hashar) [09:18:09] (03PS5) 10Elukey: check_prometheus_metric - output error from Prometheus query [puppet] - 10https://gerrit.wikimedia.org/r/381778 (owner: 10Ottomata) [09:19:50] godog: good to merge --^ right ? [09:20:23] I am trying to figure out how to fix the kafka-jumbo alarms [09:21:56] <_joe_> akosiaris: ack, I'll do a very small patch and build upon it [09:23:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] "If contint2001 is always going to be a "secondary/passive/slave/whatever" with zuul services disabled, it's probably better to just create" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/381647 (owner: 10Hashar) [09:24:01] (03CR) 10Alexandros Kosiaris: [C: 031] contint: move an include from site.pp to role [puppet] - 10https://gerrit.wikimedia.org/r/381648 (owner: 10Hashar) [09:26:28] (03CR) 10Alexandros Kosiaris: "Overall looks ok, same comment as for https://gerrit.wikimedia.org/r/#/c/381647/1 for thinking about a "passive/slave/secondary/whatever" " [puppet] - 10https://gerrit.wikimedia.org/r/381649 (owner: 10Hashar) [09:46:36] (03CR) 10Elukey: [C: 032] check_prometheus_metric - output error from Prometheus query [puppet] - 10https://gerrit.wikimedia.org/r/381778 (owner: 10Ottomata) [09:51:20] 10Operations, 10ops-eqiad, 10DBA: Test reliability of RAID configuration/database hosts on single disk failure - https://phabricator.wikimedia.org/T174054#3653239 (10Marostegui) a:03Marostegui [09:51:56] 10Operations, 10ops-eqiad, 10DBA: Test reliability of RAID configuration/database hosts on single disk failure - https://phabricator.wikimedia.org/T174054#3549516 (10Marostegui) Once depooled I will start mydumper threads to generate some read load to try to simulate a similar environment to what happened to... [09:53:54] (03PS1) 10Marostegui: mariadb: Update socket location for db1076 [puppet] - 10https://gerrit.wikimedia.org/r/381950 (https://phabricator.wikimedia.org/T174054) [09:54:27] (03CR) 10Marostegui: [C: 04-2] "Wait until the server is depooled on Thursday" [puppet] - 10https://gerrit.wikimedia.org/r/381950 (https://phabricator.wikimedia.org/T174054) (owner: 10Marostegui) [10:03:07] (03CR) 10Hashar: Convert zuul::merger to a profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/381646 (owner: 10Hashar) [10:03:45] (03PS2) 10Hashar: Convert zuul::merger to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381646 [10:03:59] (03CR) 10jerkins-bot: [V: 04-1] Convert zuul::merger to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381646 (owner: 10Hashar) [10:06:00] (03PS1) 10Giuseppe Lavagetto: profile::ci::docker: fix exec in production [puppet] - 10https://gerrit.wikimedia.org/r/381953 [10:06:07] <_joe_> hashar: ^^ [10:07:27] (03PS3) 10Hashar: Convert zuul::merger to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381646 [10:08:02] (03PS1) 10Elukey: check_prometheus_metric: use [[ ]] around if conditions [puppet] - 10https://gerrit.wikimedia.org/r/381954 [10:08:39] (03CR) 10Elukey: [C: 032] check_prometheus_metric: use [[ ]] around if conditions [puppet] - 10https://gerrit.wikimedia.org/r/381954 (owner: 10Elukey) [10:09:11] (03CR) 10Hashar: [V: 031] "Rebased and fixed a typo in the hiera key name" [puppet] - 10https://gerrit.wikimedia.org/r/381646 (owner: 10Hashar) [10:12:39] (03PS3) 10Hashar: Convert zuul::server to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381647 [10:15:03] (03CR) 10Hashar: [V: 031 C: 031] "> If contint2001 is always going to be a "secondary/passive/slave/whatever" with zuul services disabled, it's probably better to just crea" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/381647 (owner: 10Hashar) [10:16:03] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8154/" [puppet] - 10https://gerrit.wikimedia.org/r/381953 (owner: 10Giuseppe Lavagetto) [10:16:10] (03PS2) 10Giuseppe Lavagetto: profile::ci::docker: fix exec in production [puppet] - 10https://gerrit.wikimedia.org/r/381953 [10:17:50] (03PS1) 10Elukey: profile::kafka::broker::monitoring: attempt to fix prometheus alarm [puppet] - 10https://gerrit.wikimedia.org/r/381955 (https://phabricator.wikimedia.org/T175923) [10:18:20] (03PS2) 10Elukey: profile::kafka::broker::monitoring: attempt to fix prometheus alarm [puppet] - 10https://gerrit.wikimedia.org/r/381955 (https://phabricator.wikimedia.org/T175923) [10:18:53] (03PS4) 10Alexandros Kosiaris: Convert zuul::merger to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381646 (owner: 10Hashar) [10:18:59] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Convert zuul::merger to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381646 (owner: 10Hashar) [10:19:15] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [10:19:38] (03CR) 10Elukey: [C: 032] profile::kafka::broker::monitoring: attempt to fix prometheus alarm [puppet] - 10https://gerrit.wikimedia.org/r/381955 (https://phabricator.wikimedia.org/T175923) (owner: 10Elukey) [10:19:46] (03PS3) 10Elukey: profile::kafka::broker::monitoring: attempt to fix prometheus alarm [puppet] - 10https://gerrit.wikimedia.org/r/381955 (https://phabricator.wikimedia.org/T175923) [10:20:01] * elukey +2 snipered by akosiaris [10:20:04] akosiaris: thanks for the usggestion to use a role for each of ci primary vs secondary :) [10:20:24] akosiaris: I guess I will introduce the new role on role::ci::master is solely depending on profiles [10:20:51] (03PS4) 10Alexandros Kosiaris: Convert zuul::server to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381647 (owner: 10Hashar) [10:21:05] elukey: :-D [10:21:15] hashar: ok [10:21:20] akosiaris: shall I merge your changes too? [10:21:23] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Convert zuul::server to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381647 (owner: 10Hashar) [10:21:28] ah no ok [10:21:28] :D [10:21:43] whenever you are ready feel free to merge mine too [10:21:51] elukey: merged all 3 [10:22:11] \o/ [10:23:02] hashar: https://gerrit.wikimedia.org/r/#/c/381648/ seems to need a manual rebase [10:23:13] and https://gerrit.wikimedia.org/r/#/c/381649/1 as well [10:23:50] (03PS3) 10Hashar: contint: move an include from site.pp to role [puppet] - 10https://gerrit.wikimedia.org/r/381648 [10:24:01] akosiaris: yes :) [10:24:20] (03CR) 10jerkins-bot: [V: 04-1] contint: move an include from site.pp to role [puppet] - 10https://gerrit.wikimedia.org/r/381648 (owner: 10Hashar) [10:25:05] bah [10:25:15] I am hit by the wmf style guide policy :( [10:25:25] lol [10:27:25] (03PS4) 10Hashar: contint: move an include from site.pp to role [puppet] - 10https://gerrit.wikimedia.org/r/381648 [10:29:27] <_joe_> hashar: heh this is because I didn't add the rules for nodes definitions :P [10:29:30] wtf [10:29:43] _joe_: yeah the error is definitely legit [10:29:47] but [10:30:22] <_joe_> hashar: as soon as I add those, your change would be a zero-sum change [10:30:29] some oddity: the style guide checkout a branch behind the scene :( [10:30:38] <_joe_> yes [10:30:45] <_joe_> it should remove it as well [10:31:19] <_joe_> hashar: we have to check the delta of violations with the previous commit in the history, for the files involved in the change [10:31:34] yup that is smart :] [10:31:48] <_joe_> and I found a big big bug with how we detect git changed files [10:31:54] <_joe_> we don't consider partial rewrites [10:32:15] <_joe_> I'm going to have to extend Git even more [10:32:26] (03PS2) 10Hashar: contint: move jenkins from role to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381649 [10:33:04] (03CR) 10jerkins-bot: [V: 04-1] contint: move jenkins from role to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381649 (owner: 10Hashar) [10:33:11] (03PS5) 10Hashar: contint: move an include from site.pp to role [puppet] - 10https://gerrit.wikimedia.org/r/381648 [10:34:14] _joe_: https://gerrit.wikimedia.org/r/381649 that one fails for me locally and on CI with some puppetlint nocode error :( [10:34:14] <_joe_> hashar: I think you triggered some bug with that change [10:34:27] <_joe_> yeah I have to understand what's wrong there :) [10:34:39] and I end up in the style checked out branch which I guess is the parent commit [10:34:45] <_joe_> hashar: we can make the job non-voting while I figure that out [10:34:53] <_joe_> hashar: oh I see [10:35:13] (03CR) 10Hashar: [V: 031 C: 031] "I have introduced a dummy profile::ci::firewall class." [puppet] - 10https://gerrit.wikimedia.org/r/381648 (owner: 10Hashar) [10:35:35] _joe_: I would keep it around. That is acutally super useful for refactoring to profile! [10:35:50] <_joe_> ok then wait for me :P [10:35:55] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:36:15] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:36:20] <_joe_> hashar: oh I know what's wrong [10:36:27] <_joe_> how asinine of me :/ [10:36:37] <_joe_> can someone look @ulsfo? [10:36:44] <_joe_> it's strange it alarms at this time of the day [10:36:51] checking ulsfo [10:37:24] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:37:51] big spike, not related to a single backend afaics, should be gone now [10:39:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:41:55] !log mobrovac@tin Started deploy [restbase/deploy@1cc530b]: (no justification provided) [10:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:03] I am trying to figure out why I am still seeing The command defined for service Kafka Broker Replica Max Lag does not exist [10:42:19] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:42:19] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:42:28] (03PS3) 10Giuseppe Lavagetto: contint: move jenkins from role to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381649 (owner: 10Hashar) [10:42:29] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:42:30] (03PS1) 10Giuseppe Lavagetto: Rakefile: various fixes [puppet] - 10https://gerrit.wikimedia.org/r/381956 [10:42:33] I mean, I can see that it is defined in /etc/icing/etc.. on einstenium [10:43:04] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: various fixes [puppet] - 10https://gerrit.wikimedia.org/r/381956 (owner: 10Giuseppe Lavagetto) [10:43:25] <_joe_> hashar: https://integration.wikimedia.org/ci/job/operations-puppet-tests-docker/6460/console [10:43:37] <_joe_> funnily, my change has something wrong with it :P [10:43:51] ;D [10:44:13] 10Operations, 10Goal: Improve database backups' coverage, monitoring and data recovery time (part 1) (tracking) - https://phabricator.wikimedia.org/T169658#3653311 (10jcrespo) 05Open>03Resolved a:03jcrespo [10:44:52] (03PS2) 10Giuseppe Lavagetto: Rakefile: various fixes [puppet] - 10https://gerrit.wikimedia.org/r/381956 [10:47:22] (03CR) 10Filippo Giunchedi: [C: 032] base: check remote_syslog and remote_syslog_tls for emptyness [puppet] - 10https://gerrit.wikimedia.org/r/381944 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [10:49:04] (03PS1) 10Elukey: check_prometheus_metrics.cfg: correct command name [puppet] - 10https://gerrit.wikimedia.org/r/381957 (https://phabricator.wikimedia.org/T175923) [10:49:46] (03CR) 10Filippo Giunchedi: [C: 031] check_prometheus_metrics.cfg: correct command name [puppet] - 10https://gerrit.wikimedia.org/r/381957 (https://phabricator.wikimedia.org/T175923) (owner: 10Elukey) [10:49:57] (03PS2) 10Filippo Giunchedi: base: check remote_syslog and remote_syslog_tls for emptyness [puppet] - 10https://gerrit.wikimedia.org/r/381944 (https://phabricator.wikimedia.org/T136312) [10:49:59] (03PS5) 10Filippo Giunchedi: hieradata: enable syslog over tls for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/381768 (https://phabricator.wikimedia.org/T136312) [10:50:10] (03CR) 10Elukey: [C: 032] check_prometheus_metrics.cfg: correct command name [puppet] - 10https://gerrit.wikimedia.org/r/381957 (https://phabricator.wikimedia.org/T175923) (owner: 10Elukey) [10:50:33] !log mobrovac@tin Finished deploy [restbase/deploy@1cc530b]: (no justification provided) (duration: 08m 38s) [10:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:47] (03PS3) 10Filippo Giunchedi: base: check remote_syslog and remote_syslog_tls for emptyness [puppet] - 10https://gerrit.wikimedia.org/r/381944 (https://phabricator.wikimedia.org/T136312) [10:51:16] (03PS3) 10Giuseppe Lavagetto: Rakefile: various fixes [puppet] - 10https://gerrit.wikimedia.org/r/381956 [10:51:55] (03PS6) 10Filippo Giunchedi: hieradata: enable syslog over tls for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/381768 (https://phabricator.wikimedia.org/T136312) [10:52:25] (03PS4) 10Giuseppe Lavagetto: Rakefile: various fixes [puppet] - 10https://gerrit.wikimedia.org/r/381956 [10:52:29] <_joe_> ahhh rebase war! [10:52:42] (03CR) 10Giuseppe Lavagetto: [C: 032] Rakefile: various fixes [puppet] - 10https://gerrit.wikimedia.org/r/381956 (owner: 10Giuseppe Lavagetto) [10:52:49] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable syslog over tls for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/381768 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [10:53:06] <_joe_> godog: tell me when you're done [10:53:33] doh, sorry I wasn't looking [10:53:36] _joe_: done [10:53:37] <_joe_> ahah [10:53:38] <_joe_> ok [10:53:50] (03PS5) 10Giuseppe Lavagetto: Rakefile: various fixes [puppet] - 10https://gerrit.wikimedia.org/r/381956 [10:57:46] hi, there seems to be some puppet failures. [10:57:47] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: empty(): Requires either array, hash, string or integer to work with at /etc/puppet/modules/profile/manifests/base.pp:53 on node puppet-paladox3.git.eqiad.wmflabs [10:57:47] Warning: Not using cache on failed catalog [10:57:47] Error: Could not retrieve catalog; skipping run [10:59:05] line leads too [10:59:05] unless empty($remote_syslog) and empty($remote_syslog_tls) { [10:59:36] looking at https://gerrit.wikimedia.org/r/#/c/381944/3/modules/profile/manifests/base.pp , it seems to be caused by that change. [10:59:51] Filippo is rolling out a change for tls remote syslog (see above) [10:59:56] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/381295 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [11:00:10] (03CR) 10jerkins-bot: [V: 04-1] openstack: pdns auth module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/381295 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [11:00:35] paladox: yeah that's the change, though both variables are [] if not found in hiera [11:00:46] (03CR) 10Hashar: [V: 031 C: 031] "I compile" [puppet] - 10https://gerrit.wikimedia.org/r/381649 (owner: 10Hashar) [11:00:48] oh. [11:01:08] (03CR) 10Giuseppe Lavagetto: "Meh, sorry, I wanted to use this change as a testbed for the new rake tests." [puppet] - 10https://gerrit.wikimedia.org/r/381295 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [11:01:37] _joe_: congratulation on the bug fix! :] [11:01:54] akosiaris: and the jenkins role -> profile change is a noop https://gerrit.wikimedia.org/r/#/c/381649/ :] [11:02:15] it it seems to be failing on two puppet masters. with same error. [11:03:42] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:07:20] it seems it is failing on labs. [11:10:23] paladox: yeah I think I found what's wrong, fixing [11:10:29] thanks. [11:13:03] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 180051.43 seconds [11:13:34] (03PS1) 10Filippo Giunchedi: hieradata: use empty list for remote_syslog [puppet] - 10https://gerrit.wikimedia.org/r/381959 (https://phabricator.wikimedia.org/T136312) [11:14:39] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/381959 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [11:15:11] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: use empty list for remote_syslog [puppet] - 10https://gerrit.wikimedia.org/r/381959 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [11:15:13] thanks volans ! [11:15:18] yw :) [11:16:40] paladox: should be recovering on the next puppet run [11:16:46] thanks :). [11:16:52] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:17:07] yep recovers for me now :). [12:11:20] (03PS1) 10Elukey: monitoring::check_prometheus: fix ordering of arguments [puppet] - 10https://gerrit.wikimedia.org/r/381965 (https://phabricator.wikimedia.org/T175923) [12:12:48] (03CR) 10Hashar: "Comes from d46c85895107e79bb510a6cd98dec06b8e81a86c by Chase. Maybe /utils/localrun can be removed as well?" [puppet] - 10https://gerrit.wikimedia.org/r/381935 (owner: 10Giuseppe Lavagetto) [12:12:52] (03CR) 10Hashar: [C: 031] utils: remove `linter` script [puppet] - 10https://gerrit.wikimedia.org/r/381935 (owner: 10Giuseppe Lavagetto) [12:13:23] (03CR) 10Hashar: [C: 031] gitignore: add standard directory for bundle creation [puppet] - 10https://gerrit.wikimedia.org/r/381934 (owner: 10Giuseppe Lavagetto) [12:17:12] (03CR) 10Elukey: [C: 032] monitoring::check_prometheus: fix ordering of arguments [puppet] - 10https://gerrit.wikimedia.org/r/381965 (https://phabricator.wikimedia.org/T175923) (owner: 10Elukey) [12:21:18] (03CR) 10Hashar: [C: 04-1] "Requires the table "shorturl" to be created in the database." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381714 (https://phabricator.wikimedia.org/T177187) (owner: 10Jayprakash12345) [12:22:00] (03CR) 10Hashar: [C: 031] Enable NewUserMessage Extension on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381715 (https://phabricator.wikimedia.org/T177188) (owner: 10Jayprakash12345) [12:28:08] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: OK - kafka_broker_under_replicated_partitions is 0 [12:28:09] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1006 is OK: OK - kafka_broker_replica_max_lag is 0 [12:28:29] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1003 is OK: OK - kafka_broker_replica_max_lag is 0 [12:29:20] * elukey dances [12:33:02] (03CR) 10Hashar: [C: 031] "Bah it is already there" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381714 (https://phabricator.wikimedia.org/T177187) (owner: 10Jayprakash12345) [12:41:19] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is OK: OK - kafka_broker_under_replicated_partitions is 0 [12:41:59] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1002 is OK: OK - kafka_broker_replica_max_lag is 0 [12:41:59] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: OK - kafka_broker_under_replicated_partitions is 0 [12:41:59] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: OK - kafka_broker_under_replicated_partitions is 0 [12:42:10] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1004 is OK: OK - kafka_broker_replica_max_lag is 0 [12:42:30] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1005 is OK: OK - kafka_broker_replica_max_lag is 0 [12:42:39] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: OK - kafka_broker_under_replicated_partitions is 0 [12:43:03] (03PS5) 10Jayprakash12345: Enable ShortUrl Extension on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381714 (https://phabricator.wikimedia.org/T177187) [12:43:45] (03PS5) 10Jayprakash12345: Enable NewUserMessage Extension on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381715 (https://phabricator.wikimedia.org/T177188) [12:44:00] (03CR) 10Hashar: [C: 04-1] "Stretch will be in containers. Currently we provision them with a Dockerfile / shell and most probably we are not going to use puppet anym" [puppet] - 10https://gerrit.wikimedia.org/r/361680 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [12:44:26] (03CR) 10Hashar: [C: 031] "(but we can still get this merged so Paladox can experiment with stretch :] )." [puppet] - 10https://gerrit.wikimedia.org/r/361680 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [12:44:58] (03CR) 10Paladox: "thanks :)." [puppet] - 10https://gerrit.wikimedia.org/r/361680 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [12:46:00] (03PS1) 10Elukey: Add mw videoscaler hiera config for the new eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/381969 (https://phabricator.wikimedia.org/T165519) [12:49:55] (03PS2) 10Elukey: Add mw videoscaler hiera config for the new eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/381969 (https://phabricator.wikimedia.org/T165519) [12:55:29] (03PS1) 10Filippo Giunchedi: hieradata: enable syslog over tls for codfw [puppet] - 10https://gerrit.wikimedia.org/r/381970 (https://phabricator.wikimedia.org/T136312) [12:55:55] volans: ^ if you spare a minute [12:56:03] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2044" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381971 [12:56:13] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-codfw.php: Depool db2044" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381971 (owner: 10Marostegui) [12:56:19] (03Abandoned) 10Marostegui: Revert "db-codfw.php: Depool db2044" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381971 (owner: 10Marostegui) [12:56:30] godog: have you already done esams too? [12:56:45] (just wondering on the order of the DCs) [12:57:32] volans: not esams no, I decided to go "eastbound" since ulsfo didn't have much impact on the syslog servers and esams isn't that big [12:57:44] ack [12:57:48] (03CR) 10Ottomata: "YES THANK YOU!" [puppet] - 10https://gerrit.wikimedia.org/r/381965 (https://phabricator.wikimedia.org/T175923) (owner: 10Elukey) [12:57:52] (03PS1) 10Marostegui: db-codfw.php: Repool db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381972 (https://phabricator.wikimedia.org/T174764) [12:58:03] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/381970 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [12:58:17] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable syslog over tls for codfw [puppet] - 10https://gerrit.wikimedia.org/r/381970 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [12:58:27] nice, thanks [12:59:54] thank you for taking care of it! [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171003T1300). [13:00:05] Jayprakash12345, greg-g, dcausse, and kart_: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381972 (https://phabricator.wikimedia.org/T174764) (owner: 10Marostegui) [13:00:14] I can SWAT today [13:00:22] zeljkof: going to deploy db-codfw.php right now [13:00:26] give me a minute :) [13:00:35] marostegui: sure, ping me when I can start [13:00:42] sure, waiting for the merge now [13:00:50] o/ [13:00:58] ok zeljkof [13:02:30] * kart_ is around [13:03:17] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381972 (https://phabricator.wikimedia.org/T174764) (owner: 10Marostegui) [13:03:31] (03CR) 10jenkins-bot: db-codfw.php: Repool db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381972 (https://phabricator.wikimedia.org/T174764) (owner: 10Marostegui) [13:04:20] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2044 - T174764 (duration: 00m 47s) [13:04:20] zeljkof: I am done, all yours! Thank you for waiting [13:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:27] T174764: db2044 HW RAID failure - https://phabricator.wikimedia.org/T174764 [13:04:36] marostegui: no problem :) [13:04:51] (03PS1) 10Giuseppe Lavagetto: Rakefile: enhance detection of file renames [puppet] - 10https://gerrit.wikimedia.org/r/381975 [13:04:59] Jayprakash12345: reviewing your commits, will ping you in a few minutes when the first one is at mwdebug1002 [13:05:18] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381715 (https://phabricator.wikimedia.org/T177188) (owner: 10Jayprakash12345) [13:07:13] (03CR) 10Zfilipin: [C: 031] Enable ShortUrl Extension on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381714 (https://phabricator.wikimedia.org/T177187) (owner: 10Jayprakash12345) [13:07:25] 10Operations, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review, 10User-Elukey: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3653443 (10elukey) 05Open>03Resolved [13:07:56] greg-g: around for EU SWAT? you have scheduled 378784 [13:08:26] (03Merged) 10jenkins-bot: Enable NewUserMessage Extension on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381715 (https://phabricator.wikimedia.org/T177188) (owner: 10Jayprakash12345) [13:08:35] (03CR) 10jenkins-bot: Enable NewUserMessage Extension on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381715 (https://phabricator.wikimedia.org/T177188) (owner: 10Jayprakash12345) [13:10:18] (03CR) 10Zfilipin: "greg-g and Zoranzoki21 were not around for EU SWAT on October 3, please reschedule for another SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378784 (https://phabricator.wikimedia.org/T176174) (owner: 10Greg Grossmeier) [13:11:00] PROBLEM - puppet last run on wtp1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:41] Jayprakash1245: 381715 is at mwdebug1002, please test and let me know if I can deploy [13:12:40] everthing is fine. [13:12:56] Jayprakash1245: ok, deploying [13:13:03] You can deploy [13:13:09] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/381975 (owner: 10Giuseppe Lavagetto) [13:13:14] 10Operations, 10ORES, 10Graphite, 10Scoring-platform-team (Current), 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3653457 (10fgiunchedi) @awight I shared the list with Amir on T174542, happy to share it with you too (4MB gz file, 500k lines). I... [13:13:23] (03CR) 10Zfilipin: [C: 031] Adjust wgNamespacesToBeSearchedDefault for enwikibooks, fawikibooks and hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381766 (https://phabricator.wikimedia.org/T176906) (owner: 10DCausse) [13:13:56] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:381715|Enable NewUserMessage Extension on hiwikiversity (T177188)]] (duration: 00m 46s) [13:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:01] T177188: Enable NewUserMessage Extension on hiwikiversity - https://phabricator.wikimedia.org/T177188 [13:14:06] dcausse: can you please paste the exact script (with parameters) that needs to be run for 381766? [13:14:23] Jayprakash1245: deployed, please check [13:14:36] (03PS6) 10Zfilipin: Enable ShortUrl Extension on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381714 (https://phabricator.wikimedia.org/T177187) (owner: 10Jayprakash12345) [13:14:40] zeljkof: it's using custom elastic apis not mwscript [13:14:51] (03CR) 10Giuseppe Lavagetto: [C: 032] Rakefile: enhance detection of file renames [puppet] - 10https://gerrit.wikimedia.org/r/381975 (owner: 10Giuseppe Lavagetto) [13:14:55] dcausse: are you going to run the script after the deploy? [13:15:07] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381714 (https://phabricator.wikimedia.org/T177187) (owner: 10Jayprakash12345) [13:15:10] zeljkof: terbium.eqiad.wmnet:/home/dcausse/reindex_ns_381766.sh [13:15:14] zeljkof: yes I'll run it [13:15:24] dcausse: cool, will ping you when done [13:15:28] thanks [13:15:52] zeljkof: yah succefully Deployed. [13:16:14] Jayprakash1245: great, I am merging the second patch, will ping you in a minute when it's at mwdebug [13:17:55] (03Merged) 10jenkins-bot: Enable ShortUrl Extension on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381714 (https://phabricator.wikimedia.org/T177187) (owner: 10Jayprakash12345) [13:18:05] (03CR) 10jenkins-bot: Enable ShortUrl Extension on hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381714 (https://phabricator.wikimedia.org/T177187) (owner: 10Jayprakash12345) [13:18:48] (03PS2) 10Giuseppe Lavagetto: gitignore: add standard directory for bundle creation [puppet] - 10https://gerrit.wikimedia.org/r/381934 [13:19:01] Jayprakash1245: 381714 is at mwdebug1002, please test and let me know if I can continue [13:19:18] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: OK - kafka_broker_under_replicated_partitions is 0 [13:19:30] (03CR) 10Giuseppe Lavagetto: [C: 032] gitignore: add standard directory for bundle creation [puppet] - 10https://gerrit.wikimedia.org/r/381934 (owner: 10Giuseppe Lavagetto) [13:19:32] (03PS3) 10Zfilipin: Adjust wgNamespacesToBeSearchedDefault for enwikibooks, fawikibooks and hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381766 (https://phabricator.wikimedia.org/T176906) (owner: 10DCausse) [13:19:51] (03PS2) 10Giuseppe Lavagetto: utils: remove `linter` script [puppet] - 10https://gerrit.wikimedia.org/r/381935 [13:20:58] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1001 is OK: OK - kafka_broker_replica_max_lag is 0 [13:21:10] backticks, _joe_? :) [13:21:21] if it was `linter' I'd attribute it to your PhD past ;) [13:21:25] 10Operations, 10Contributors-Team, 10MobileFrontend, 10wikidiff2, and 2 others: Diff page consistently produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3653488 (10jkroll) @MaxSem: Jkroll [13:22:02] zeljkof: ping? [13:22:10] Dereckson: yes? [13:22:11] zeljkof: you can deploy https://gerrit.wikimedia.org/r/#/c/378784/ I can test it [13:22:30] Dereckson: deal! :) [13:22:43] (03PS2) 10Zfilipin: Revert "Limit thanks for new users at pl.wikipedia to 3 per day" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378784 (https://phabricator.wikimedia.org/T176174) (owner: 10Greg Grossmeier) [13:24:08] zeljkof: why not +2? :) [13:25:10] kart_: I usually +1 commits that I have reviewed and +2 before the deployment, but since your is the only one in an extension, you are correct, I can +2 immediately :) [13:25:20] !log Optimize templatelinks and pagelinks tables on s1, s4 and s6 on labsdb1010 - T174509 [13:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:29] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [13:26:33] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:381714|Enable ShortUrl Extension on hiwikiversity (T177187)]] (duration: 00m 46s) [13:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:40] T177187: Enable ShortUrl Extension on hiwikiversity - https://phabricator.wikimedia.org/T177187 [13:27:57] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378784 (https://phabricator.wikimedia.org/T176174) (owner: 10Greg Grossmeier) [13:28:29] Dereckson: merging 378784, will ping you when it's at mwdebug1002, should take a minute or two [13:29:03] oki [13:29:39] Yah Everthing is ok [13:29:47] (03CR) 10Zfilipin: [C: 032] "Dereckson volunteered to babysit the deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378784 (https://phabricator.wikimedia.org/T176174) (owner: 10Greg Grossmeier) [13:30:03] Deploy (https://hi.wikiversity.org/s/2) [13:30:13] (03Merged) 10jenkins-bot: Revert "Limit thanks for new users at pl.wikipedia to 3 per day" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378784 (https://phabricator.wikimedia.org/T176174) (owner: 10Greg Grossmeier) [13:30:21] Jayprakash12345: great, I have deployed it, please check and thanks for deploying with #releng ;) [13:30:27] (03CR) 10jenkins-bot: Revert "Limit thanks for new users at pl.wikipedia to 3 per day" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378784 (https://phabricator.wikimedia.org/T176174) (owner: 10Greg Grossmeier) [13:31:12] Dereckson: 378784 is at mwdebug1002, please test and let me know if I can continue [13:31:41] (03PS4) 10Zfilipin: Adjust wgNamespacesToBeSearchedDefault for enwikibooks, fawikibooks and hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381766 (https://phabricator.wikimedia.org/T176906) (owner: 10DCausse) [13:32:32] So anything else for me? [13:32:46] So anything else for me? [13:33:14] Jayprakash12345: I think that's all, you had two commits right? both are deployed [13:33:32] zeljkof: thanks extension still works fine [13:33:42] yes so I quiet [13:33:51] and limits through mwrepl looks good, so yes, you can deploy :) [13:33:52] Dereckson: ok, good to deploy? [13:33:54] (03PS1) 10Elukey: Enable basic ACL handling on the Kafka Jumbo cluster [puppet] - 10https://gerrit.wikimedia.org/r/381980 (https://phabricator.wikimedia.org/T173493) [13:33:57] * Dereckson nods [13:33:58] ok, deploying [13:34:26] (03CR) 10jerkins-bot: [V: 04-1] Enable basic ACL handling on the Kafka Jumbo cluster [puppet] - 10https://gerrit.wikimedia.org/r/381980 (https://phabricator.wikimedia.org/T173493) (owner: 10Elukey) [13:34:55] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:378784|Revert "Limit thanks for new users at pl.wikipedia to 3 per day" (T176174 T169268)]] (duration: 00m 46s) [13:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:01] T176174: Revert Thanks limit on pl.wp for new users - https://phabricator.wikimedia.org/T176174 [13:35:01] T169268: Limiting thanks for new users at pl.wikipedia - https://phabricator.wikimedia.org/T169268 [13:35:13] Dereckson: deployed, please check [13:35:38] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381766 (https://phabricator.wikimedia.org/T176906) (owner: 10DCausse) [13:37:14] (03Merged) 10jenkins-bot: Adjust wgNamespacesToBeSearchedDefault for enwikibooks, fawikibooks and hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381766 (https://phabricator.wikimedia.org/T176906) (owner: 10DCausse) [13:37:21] dcausse: please stand by, merging 381766, will ping you in 1-2 minutes when it's at mwdebug1002 [13:37:24] (03CR) 10jenkins-bot: Adjust wgNamespacesToBeSearchedDefault for enwikibooks, fawikibooks and hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381766 (https://phabricator.wikimedia.org/T176906) (owner: 10DCausse) [13:38:12] it is a bit difficult (at least for me) to get what wmf guidelines is wrong from https://integration.wikimedia.org/ci/job/operations-puppet-tests-docker/6478/console [13:38:18] RECOVERY - puppet last run on wtp1038 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [13:38:47] dcausse: 381766 is at mwdebug1002, please test and let me know if I can deploy [13:38:55] zeljkof: testing [13:39:02] 13:34:24 ArgumentError: wrong number of arguments (0 for 1) [13:39:02] 13:34:24 /tmp/cache/puppet/Rakefile:200:in `linter_problems' [13:39:02] 13:34:24 /tmp/cache/puppet/Rakefile:281:in `block in setup_wmf_styleguide_delta' [13:39:06] elukey ^^ [13:39:46] paladox: thanks but it still feels a bit cryptic [13:39:47] :D [13:39:56] your welcome :). [13:40:29] * elukey gets coffee [13:40:36] zeljkof: looks good to me [13:40:43] dcausse: ok, deploying [13:41:32] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:381766|Adjust wgNamespacesToBeSearchedDefault for enwikibooks, fawikibooks and hewikisource (T176906 T176908 T176907)]] (duration: 00m 46s) [13:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:39] T176906: Special:Search—add additional namespaces to search results for English Wikibooks - https://phabricator.wikimedia.org/T176906 [13:41:40] T176907: Special:Search—add additional namespaces to search results for Hebrew Wikisource - https://phabricator.wikimedia.org/T176907 [13:41:40] T176908: Special:Search—add additional namespaces to search results for Persian Wikibooks - https://phabricator.wikimedia.org/T176908 [13:42:01] dcausse: deployed, please check, run the scripts, monitor logs... and thanks for deploying with #releng ;) [13:42:10] zeljkof: thanks! [13:42:42] kart_: please stand by, will ping you in a minute when the patch is at mwdebug1002 [13:42:55] zeljkof: sure [13:45:49] kart_: 381967 is at mwdebug1002, please check and let me know if I can deploy [13:46:44] !log upgrade grafana to 4.5.2 on krypton - T175980 [13:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:50] T175980: Upgrade grafana to 4.5.2 - https://phabricator.wikimedia.org/T175980 [13:46:52] gah, sorry that was in the middle of swat [13:47:12] shouldn't be impacting though [13:47:15] let me know if it is [13:47:52] godog: will do :) [13:48:13] thanks zeljkof ! [13:49:17] (03PS1) 10Alexandros Kosiaris: raid: Increase retry interval to 10 mins [puppet] - 10https://gerrit.wikimedia.org/r/381981 (https://phabricator.wikimedia.org/T173311) [13:49:47] (03CR) 10jerkins-bot: [V: 04-1] raid: Increase retry interval to 10 mins [puppet] - 10https://gerrit.wikimedia.org/r/381981 (https://phabricator.wikimedia.org/T173311) (owner: 10Alexandros Kosiaris) [13:51:06] kart_: do you need more time to test? [13:51:27] zeljkof: give a minute. [13:51:39] kart_: sure, just checking [13:54:18] zeljkof: looks good. Go ahead. [13:54:26] kart_: deploying [13:54:35] same error for akosiaris afaics, the check might have issues? [13:54:39] Cc hashar :) [13:54:59] with the wmf style guide ? [13:55:16] !log zfilipin@tin Synchronized php-1.31.0-wmf.1/extensions/Translate/webservices/CxserverWebService.php: SWAT: [[gerrit:381967|Make CxserverWebService forward compatible]] (duration: 00m 45s) [13:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:24] 00:00:14.037 ArgumentError: wrong number of arguments (0 for 1) [13:55:24] 00:00:14.038 /tmp/cache/puppet/Rakefile:200:in `linter_problems' [13:55:24] 00:00:14.038 /tmp/cache/puppet/Rakefile:281:in `block in setup_wmf_styleguide_delta' [13:55:33] _joe_: some more oddity! ^ :D [13:55:55] kart_: deployed, please check and thanks for deploying with #releng ;) [13:56:06] <_joe_> uhm [13:56:14] <_joe_> yeah that's from my last patch I guess [13:56:14] !log EU SWAT finished [13:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:24] <_joe_> fixing asap [13:56:32] (03PS1) 10Alexandros Kosiaris: check_puppetrun: Execute less often [puppet] - 10https://gerrit.wikimedia.org/r/381982 (https://phabricator.wikimedia.org/T173427) [13:56:42] (03PS2) 10Elukey: Enable basic ACL handling on the Kafka Jumbo cluster [puppet] - 10https://gerrit.wikimedia.org/r/381980 (https://phabricator.wikimedia.org/T173493) [13:56:46] 10Operations, 10monitoring, 10User-fgiunchedi: Upgrade grafana to 4.5.2 - https://phabricator.wikimedia.org/T175980#3653596 (10fgiunchedi) 05Open>03Resolved All done! [13:57:13] (03CR) 10jerkins-bot: [V: 04-1] Enable basic ACL handling on the Kafka Jumbo cluster [puppet] - 10https://gerrit.wikimedia.org/r/381980 (https://phabricator.wikimedia.org/T173493) (owner: 10Elukey) [13:57:17] (03CR) 10jerkins-bot: [V: 04-1] check_puppetrun: Execute less often [puppet] - 10https://gerrit.wikimedia.org/r/381982 (https://phabricator.wikimedia.org/T173427) (owner: 10Alexandros Kosiaris) [13:57:36] <_joe_> yeah guys my fault [13:57:37] <_joe_> fixing [13:57:40] _joe_: - diff.name_status.select { |_, status| 'ACM'.include? status}.keys lacked R I am not sure why though [13:57:51] <_joe_> hashar: it does not, that's ok [13:58:00] <_joe_> the error is in anoher thing [13:59:36] (03PS1) 10Giuseppe Lavagetto: Rakefile: fix call to linter_problems [puppet] - 10https://gerrit.wikimedia.org/r/381986 [14:00:00] <_joe_> hashar: 'R' tells you only the name of the old file in name_status [14:00:19] <_joe_> akosiaris, elukey rebase onto my change [14:00:22] (03CR) 10Elukey: "pcc https://puppet-compiler.wmflabs.org/compiler02/8160/kafka-jumbo1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/381980 (https://phabricator.wikimedia.org/T173493) (owner: 10Elukey) [14:00:27] (03CR) 10Giuseppe Lavagetto: [C: 032] Rakefile: fix call to linter_problems [puppet] - 10https://gerrit.wikimedia.org/r/381986 (owner: 10Giuseppe Lavagetto) [14:00:58] _joe_ ack thanks! [14:00:58] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/381982 (https://phabricator.wikimedia.org/T173427) (owner: 10Alexandros Kosiaris) [14:01:22] <_joe_> hashar: unless he rebases, that's never gonna pass [14:01:51] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/381980 (https://phabricator.wikimedia.org/T173493) (owner: 10Elukey) [14:02:27] _joe_: CI first merge the patch against the tip of the branch :) [14:02:34] that fixed both \o/ [14:03:03] zeljkof: Thanks a lot! [14:05:08] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:05:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:07:22] 10Operations, 10monitoring: Fix permissions for systemd file - https://phabricator.wikimedia.org/T155869#3653608 (10akosiaris) 05Open>03Resolved Fixed since rOPUPcef2cb7ea58e85cbe54ac5, resolving [14:08:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:08:15] those are two spikes between 14:00 and 14:03, already recovered [14:12:58] (03CR) 10Giuseppe Lavagetto: [C: 032] utils: remove `linter` script [puppet] - 10https://gerrit.wikimedia.org/r/381935 (owner: 10Giuseppe Lavagetto) [14:12:59] (03PS1) 10Giuseppe Lavagetto: profile::docker::registry: conform to current style guide [puppet] - 10https://gerrit.wikimedia.org/r/381987 [14:13:02] 10Operations, 10Traffic: cp3032 ethernet link down (bnx2x dump in the dmesg) - https://phabricator.wikimedia.org/T166758#3653616 (10BBlack) 05Open>03Resolved a:03BBlack Hasn't recurred AFAIK. Note this is similar to bnx2x dmesg we managed to induce on a bunch of upload@ulsfo machines via bad NUMA tuning... [14:13:04] (03PS3) 10Giuseppe Lavagetto: utils: remove `linter` script [puppet] - 10https://gerrit.wikimedia.org/r/381935 [14:13:50] 10Operations, 10ops-esams, 10hardware-requests: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883#3653621 (10BBlack) [14:13:52] 10Operations, 10ops-esams, 10Traffic: cp3021 failed disk sdb - https://phabricator.wikimedia.org/T148983#3653623 (10BBlack) [14:14:16] (03PS2) 10Jcrespo: raid: Increase retry interval to 10 mins [puppet] - 10https://gerrit.wikimedia.org/r/381981 (https://phabricator.wikimedia.org/T173311) (owner: 10Alexandros Kosiaris) [14:16:03] (03PS3) 10Elukey: Enable basic ACL handling on the Kafka Jumbo cluster [puppet] - 10https://gerrit.wikimedia.org/r/381980 (https://phabricator.wikimedia.org/T173493) [14:16:18] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:16:18] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:17:29] (03PS3) 10Jcrespo: raid: Increase retry interval to 10 mins [puppet] - 10https://gerrit.wikimedia.org/r/381981 (https://phabricator.wikimedia.org/T173311) (owner: 10Alexandros Kosiaris) [14:19:10] (03CR) 10Ottomata: [C: 031] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/381980 (https://phabricator.wikimedia.org/T173493) (owner: 10Elukey) [14:20:32] (03CR) 10Jcrespo: "I made it work, but I also made it configurable. If you have the call chain, don't hate me (I am just following the syle guide). I may als" [puppet] - 10https://gerrit.wikimedia.org/r/381981 (https://phabricator.wikimedia.org/T173311) (owner: 10Alexandros Kosiaris) [14:20:51] (03CR) 10Jcrespo: "s/have/hate/" [puppet] - 10https://gerrit.wikimedia.org/r/381981 (https://phabricator.wikimedia.org/T173311) (owner: 10Alexandros Kosiaris) [14:22:45] 10Operations, 10monitoring, 10Patch-For-Review: Review check_raid_hpssacli frequency - https://phabricator.wikimedia.org/T173311#3653646 (10jcrespo) I hope you don't hate much my proposal on gerrit (it can be reverted easily). [14:23:23] (03PS1) 10BBlack: cp3009: remove from cluster [puppet] - 10https://gerrit.wikimedia.org/r/381988 (https://phabricator.wikimedia.org/T148422) [14:24:00] (03CR) 10BBlack: [C: 032] cp3009: remove from cluster [puppet] - 10https://gerrit.wikimedia.org/r/381988 (https://phabricator.wikimedia.org/T148422) (owner: 10BBlack) [14:25:38] RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 14 ESP OK [14:26:08] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 26 ESP OK [14:26:18] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 14 ESP OK [14:26:28] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 114 ESP OK [14:26:28] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 26 ESP OK [14:26:28] RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 14 ESP OK [14:26:38] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 26 ESP OK [14:26:39] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 114 ESP OK [14:26:48] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 26 ESP OK [14:26:58] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 114 ESP OK [14:27:02] (03CR) 10Alexandros Kosiaris: [C: 031] utils: remove `linter` script [puppet] - 10https://gerrit.wikimedia.org/r/381935 (owner: 10Giuseppe Lavagetto) [14:27:08] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 14 ESP OK [14:27:08] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 114 ESP OK [14:27:09] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 114 ESP OK [14:27:28] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 114 ESP OK [14:29:23] (03CR) 10Alexandros Kosiaris: "Hm, I tend to have my bundle under vendor/bundle but this will work as well" [puppet] - 10https://gerrit.wikimedia.org/r/381934 (owner: 10Giuseppe Lavagetto) [14:31:22] (03CR) 10Alexandros Kosiaris: [C: 031] utils/git-setup: add installation of the post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/381936 (owner: 10Giuseppe Lavagetto) [14:41:08] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2] [14:41:25] poor thorium [14:41:57] did somebody from the Analytics team deploy something horrible to you? Those guys are really shameless :P [14:42:10] jokes aside, checking :) [14:44:03] (03CR) 10Alexandros Kosiaris: [C: 032] raid: Increase retry interval to 10 mins [puppet] - 10https://gerrit.wikimedia.org/r/381981 (https://phabricator.wikimedia.org/T173311) (owner: 10Alexandros Kosiaris) [14:44:22] (03CR) 10Alexandros Kosiaris: [C: 032] "That's quite better than mine actually, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/381981 (https://phabricator.wikimedia.org/T173311) (owner: 10Alexandros Kosiaris) [14:44:26] (03PS4) 10Alexandros Kosiaris: raid: Increase retry interval to 10 mins [puppet] - 10https://gerrit.wikimedia.org/r/381981 (https://phabricator.wikimedia.org/T173311) [14:46:53] 10Operations, 10monitoring, 10Patch-For-Review: Review check_raid_hpssacli frequency - https://phabricator.wikimedia.org/T173311#3653701 (10akosiaris) Nope, looks better than mine. +2ed already. Thanks [14:51:15] 10Operations, 10monitoring, 10Patch-For-Review: Review check_raid_hpssacli frequency - https://phabricator.wikimedia.org/T173311#3653726 (10jcrespo) What about having selected hosts with the previous "high" rate on hiera? is that something that you would be ok with? E.g. active core production mysqls (those... [14:52:19] !log mobrovac@tin Started deploy [restbase/deploy@02acc45]: (no justification provided) [14:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:13] 10Operations, 10monitoring, 10Patch-For-Review: Review check_raid_hpssacli frequency - https://phabricator.wikimedia.org/T173311#3653727 (10akosiaris) Sounds fine to me. My only question is "Is this something you intended to do via hosts/ hiera or via role?". I favor the latter but would be ok with the forme... [14:55:19] PROBLEM - Restbase root url on restbase1007 is CRITICAL: connect to address 10.64.0.223 and port 7231: Connection refused [14:56:23] !log upgrading basic packages on cp1065 [14:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:44] !log mobrovac@tin Finished deploy [restbase/deploy@02acc45]: (no justification provided) (duration: 04m 25s) [14:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:19] RECOVERY - Restbase root url on restbase1007 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.007 second response time [14:58:29] (03CR) 10Alexandros Kosiaris: [C: 032] check_puppetrun: Execute less often [puppet] - 10https://gerrit.wikimedia.org/r/381982 (https://phabricator.wikimedia.org/T173427) (owner: 10Alexandros Kosiaris) [14:58:34] (03PS2) 10Alexandros Kosiaris: check_puppetrun: Execute less often [puppet] - 10https://gerrit.wikimedia.org/r/381982 (https://phabricator.wikimedia.org/T173427) [14:58:37] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] check_puppetrun: Execute less often [puppet] - 10https://gerrit.wikimedia.org/r/381982 (https://phabricator.wikimedia.org/T173427) (owner: 10Alexandros Kosiaris) [14:59:26] !log mobrovac@tin Started deploy [restbase/deploy@02acc45]: (no justification provided) [14:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:30] 10Operations, 10monitoring, 10Patch-For-Review: Review check_raid_hpssacli frequency - https://phabricator.wikimedia.org/T173311#3653756 (10jcrespo) It should be through the role::mariadb::core role key on hiera, but actually I do not think it is possible, because we cannot say on hiera what is the primary m... [14:59:53] (03PS1) 10Elukey: wikistats2: migrate git repo from differencial to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/381995 (https://phabricator.wikimedia.org/T177288) [15:01:19] 10Operations, 10monitoring, 10Patch-For-Review: Review check_puppetrun frequency - https://phabricator.wikimedia.org/T173427#3653762 (10akosiaris) 05Open>03Resolved Changed merged, resolving [15:02:38] PROBLEM - Restbase root url on restbase2002 is CRITICAL: connect to address 10.192.16.153 and port 7231: Connection refused [15:03:29] 10Operations, 10monitoring, 10Patch-For-Review: Review check_raid_hpssacli frequency - https://phabricator.wikimedia.org/T173311#3653765 (10akosiaris) Yeah I was afraid of that, which is why I said I was fine with hosts/ hiera temporarily (for some definition of temporary). Anyway, I am fine with the default... [15:04:39] !log mobrovac@tin Finished deploy [restbase/deploy@02acc45]: (no justification provided) (duration: 05m 14s) [15:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:00] known ^ [15:07:48] PROBLEM - Check systemd state on restbase2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:08:21] (03CR) 10Elukey: [C: 032] wikistats2: migrate git repo from differencial to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/381995 (https://phabricator.wikimedia.org/T177288) (owner: 10Elukey) [15:08:38] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:10:29] (03PS1) 10Filippo Giunchedi: base: cater for rsyslog on trusty running as 'syslog' [puppet] - 10https://gerrit.wikimedia.org/r/381997 (https://phabricator.wikimedia.org/T136312) [15:10:32] (03PS1) 10Filippo Giunchedi: rsyslog: bump maximum TCP sessions [puppet] - 10https://gerrit.wikimedia.org/r/381998 (https://phabricator.wikimedia.org/T136312) [15:10:52] (03CR) 10Filippo Giunchedi: "Not the prettiest fix, though trusty is on its way out" [puppet] - 10https://gerrit.wikimedia.org/r/381997 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [15:12:54] ACKNOWLEDGEMENT - Check systemd state on restbase2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Marko Obrovac cassandra 3 keyspace creation - The acknowledgement expires at: 2017-10-04 17:12:16. [15:12:54] ACKNOWLEDGEMENT - Restbase root url on restbase2002 is CRITICAL: connect to address 10.192.16.153 and port 7231: Connection refused Marko Obrovac cassandra 3 keyspace creation - The acknowledgement expires at: 2017-10-04 17:12:16. [15:13:04] PROBLEM - DPKG on cp3048 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:13:54] PROBLEM - DPKG on cp1051 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:13:54] PROBLEM - DPKG on cp1064 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:13:54] PROBLEM - DPKG on cp2006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:13:55] PROBLEM - DPKG on cp1054 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:13:55] PROBLEM - DPKG on cp3010 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:04] PROBLEM - DPKG on cp2007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:04] PROBLEM - DPKG on cp2018 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:04] PROBLEM - DPKG on cp1072 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:04] PROBLEM - DPKG on cp4009 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:04] PROBLEM - DPKG on cp3031 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:04] PROBLEM - DPKG on cp3045 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:04] PROBLEM - DPKG on cp4025 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:05] PROBLEM - DPKG on cp4028 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:05] PROBLEM - DPKG on cp2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:06] that's me, looking, but I think it's false [15:14:06] PROBLEM - DPKG on cp1066 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:06] PROBLEM - DPKG on cp3032 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:07] PROBLEM - DPKG on cp3036 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:24] PROBLEM - DPKG on cp1067 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:24] PROBLEM - DPKG on cp1055 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:24] PROBLEM - DPKG on cp1062 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:24] PROBLEM - DPKG on cp2013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:25] PROBLEM - DPKG on cp1058 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:25] PROBLEM - DPKG on cp1061 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:25] PROBLEM - DPKG on cp1099 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:34] PROBLEM - DPKG on cp2005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:34] PROBLEM - DPKG on cp2016 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:34] PROBLEM - DPKG on cp1065 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:34] PROBLEM - DPKG on cp1063 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:34] PROBLEM - DPKG on cp1074 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:35] PROBLEM - DPKG on cp4017 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:35] PROBLEM - DPKG on cp4022 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:35] PROBLEM - DPKG on cp2017 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:38] PROBLEM - DPKG on cp4018 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:38] PROBLEM - DPKG on cp2025 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:38] PROBLEM - DPKG on cp1045 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:38] PROBLEM - DPKG on cp4027 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:42] apparently "broken package" includes any set-selection temporary holds :P [15:14:44] PROBLEM - DPKG on cp3007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:44] PROBLEM - DPKG on cp3008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:44] PROBLEM - DPKG on cp3037 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:44] PROBLEM - DPKG on cp3046 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:44] PROBLEM - DPKG on cp1068 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:44] PROBLEM - DPKG on cp2010 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:44] PROBLEM - DPKG on cp2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:14:45] PROBLEM - DPKG on cp2008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:15:04] PROBLEM - DPKG on cp4026 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:15:04] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-common] [15:15:34] PROBLEM - puppet last run on cp4027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-common] [15:16:03] undoing for now, as apparently it breaks puppet's require_package too :P [15:16:04] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-common] [15:16:24] PROBLEM - DPKG on cp2012 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:16:24] PROBLEM - DPKG on cp4021 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:16:24] RECOVERY - DPKG on cp1067 is OK: All packages OK [15:16:24] RECOVERY - DPKG on cp1055 is OK: All packages OK [15:16:24] RECOVERY - DPKG on cp1062 is OK: All packages OK [15:16:24] RECOVERY - DPKG on cp2013 is OK: All packages OK [15:16:24] RECOVERY - DPKG on cp1058 is OK: All packages OK [15:16:25] RECOVERY - DPKG on cp1061 is OK: All packages OK [15:16:25] RECOVERY - DPKG on cp1099 is OK: All packages OK [15:16:34] RECOVERY - DPKG on cp2005 is OK: All packages OK [15:16:36] RECOVERY - DPKG on cp2016 is OK: All packages OK [15:16:36] RECOVERY - DPKG on cp1065 is OK: All packages OK [15:16:36] RECOVERY - DPKG on cp1063 is OK: All packages OK [15:16:36] RECOVERY - DPKG on cp4017 is OK: All packages OK [15:16:36] RECOVERY - DPKG on cp4022 is OK: All packages OK [15:16:36] RECOVERY - DPKG on cp1074 is OK: All packages OK [15:16:36] RECOVERY - DPKG on cp2017 is OK: All packages OK [15:16:36] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-common] [15:16:36] RECOVERY - DPKG on cp2025 is OK: All packages OK [15:16:37] RECOVERY - DPKG on cp4018 is OK: All packages OK [15:16:37] RECOVERY - DPKG on cp1045 is OK: All packages OK [15:16:38] RECOVERY - DPKG on cp4027 is OK: All packages OK [15:16:44] what a mess [15:16:44] RECOVERY - DPKG on cp1068 is OK: All packages OK [15:16:44] RECOVERY - DPKG on cp3007 is OK: All packages OK [15:16:44] RECOVERY - DPKG on cp3008 is OK: All packages OK [15:16:44] RECOVERY - DPKG on cp3037 is OK: All packages OK [15:16:44] RECOVERY - DPKG on cp3046 is OK: All packages OK [15:16:44] RECOVERY - DPKG on cp2010 is OK: All packages OK [15:16:45] RECOVERY - DPKG on cp2004 is OK: All packages OK [15:17:04] RECOVERY - DPKG on cp3010 is OK: All packages OK [15:17:04] RECOVERY - DPKG on cp2007 is OK: All packages OK [15:17:04] RECOVERY - DPKG on cp2018 is OK: All packages OK [15:17:04] RECOVERY - DPKG on cp1072 is OK: All packages OK [15:17:04] RECOVERY - DPKG on cp2001 is OK: All packages OK [15:17:04] RECOVERY - DPKG on cp4009 is OK: All packages OK [15:17:05] RECOVERY - DPKG on cp4025 is OK: All packages OK [15:17:05] RECOVERY - DPKG on cp4026 is OK: All packages OK [15:17:06] RECOVERY - DPKG on cp4028 is OK: All packages OK [15:17:06] RECOVERY - DPKG on cp3031 is OK: All packages OK [15:17:07] RECOVERY - DPKG on cp3045 is OK: All packages OK [15:17:07] RECOVERY - DPKG on cp1066 is OK: All packages OK [15:17:08] PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-common] [15:17:08] RECOVERY - DPKG on cp1048 is OK: All packages OK [15:17:24] RECOVERY - DPKG on cp2012 is OK: All packages OK [15:17:26] RECOVERY - DPKG on cp4021 is OK: All packages OK [15:17:54] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-common] [15:17:54] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-common] [15:18:05] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-common] [15:18:44] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-common] [15:18:55] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-common] [15:18:55] RECOVERY - Check systemd state on restbase2002 is OK: OK - running: The system is fully operational [15:19:05] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-common] [15:20:05] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:20:45] RECOVERY - Restbase root url on restbase2002 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.085 second response time [15:20:54] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:21:02] 10Operations, 10monitoring, 10Patch-For-Review: Review check_raid_hpssacli frequency - https://phabricator.wikimedia.org/T173311#3653823 (10faidon) What is the high-level/human description of the policy we want to enforce for database servers? (e.g "HP servers in the active datacenter need the check to run e... [15:21:14] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:21:14] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:21:14] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:21:14] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:21:44] RECOVERY - puppet last run on cp4027 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:21:44] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:21:45] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:21:55] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:22:04] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:23:31] !log mobrovac@tin Started deploy [restbase/deploy@02acc45]: (no justification provided) [15:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:48] <_joe_> !log uploading blubber to jessie-wikimedia T175296 [15:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:53] T175296: Install Blubber on contint1001 - https://phabricator.wikimedia.org/T175296 [15:28:48] (03PS1) 10Giuseppe Lavagetto: profile::ci::slave: install blubber [puppet] - 10https://gerrit.wikimedia.org/r/382004 (https://phabricator.wikimedia.org/T175296) [15:29:58] !log mobrovac@tin Finished deploy [restbase/deploy@02acc45]: (no justification provided) (duration: 06m 27s) [15:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:13] 10Operations, 10Cloud-Services, 10User-bd808: Investigate ceasing self-service new Trusty instance creation in Labs - https://phabricator.wikimedia.org/T161899#3653856 (10chasemp) a:03bd808 [15:30:29] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/381998 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [15:33:34] (03CR) 10Volans: [C: 031] "LGTM, if possible try to get Moritz advice on the export puppet cert to the syslog user." [puppet] - 10https://gerrit.wikimedia.org/r/381997 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [15:35:18] (03PS2) 10Filippo Giunchedi: base: cater for rsyslog on trusty running as 'syslog' [puppet] - 10https://gerrit.wikimedia.org/r/381997 (https://phabricator.wikimedia.org/T136312) [15:36:35] (03CR) 10Filippo Giunchedi: [C: 032] base: cater for rsyslog on trusty running as 'syslog' [puppet] - 10https://gerrit.wikimedia.org/r/381997 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [15:36:45] (03CR) 10Filippo Giunchedi: [C: 032] rsyslog: bump maximum TCP sessions [puppet] - 10https://gerrit.wikimedia.org/r/381998 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [15:36:59] (03PS2) 10Filippo Giunchedi: rsyslog: bump maximum TCP sessions [puppet] - 10https://gerrit.wikimedia.org/r/381998 (https://phabricator.wikimedia.org/T136312) [15:39:21] we rock: expose_puppet_cert has user not owner as a parameter [15:39:27] (03PS1) 10BBlack: check_dpkg: warn (not crit) if packages are held [puppet] - 10https://gerrit.wikimedia.org/r/382010 [15:39:48] and then of course owner => $user, [15:40:20] ops... I didn't double check, sorry [15:40:41] me neither, rookie mistake to assume consistency [15:41:14] (03PS1) 10Filippo Giunchedi: base: fix expose_puppet_certs owner vs user [puppet] - 10https://gerrit.wikimedia.org/r/382011 [15:41:44] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:07] there will be some puppet failures, sorry about that [15:42:12] (03CR) 10Filippo Giunchedi: [C: 032] base: fix expose_puppet_certs owner vs user [puppet] - 10https://gerrit.wikimedia.org/r/382011 (owner: 10Filippo Giunchedi) [15:42:47] PROBLEM - puppet last run on elastic2034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:58] !log upgrading basic packages on role::cache::misc [15:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:27] PROBLEM - puppet last run on cp4026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:37] PROBLEM - puppet last run on wtp2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:46] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:46] PROBLEM - puppet last run on restbase2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:57] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:16] PROBLEM - puppet last run on es2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:16] PROBLEM - puppet last run on mw2218 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:26] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:26] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:27] PROBLEM - puppet last run on db2087 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:27] PROBLEM - puppet last run on db2089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:27] PROBLEM - puppet last run on es2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:27] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:27] PROBLEM - puppet last run on restbase-test2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:36] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:36] PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:36] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:37] PROBLEM - puppet last run on mw2198 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:37] PROBLEM - puppet last run on ms-be2036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:37] PROBLEM - puppet last run on es2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:37] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:37] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:37] godog: feel free to run https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed ;) [15:44:38] PROBLEM - puppet last run on db2086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:46] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:46] PROBLEM - puppet last run on ms-be2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:46] PROBLEM - puppet last run on cp2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:46] PROBLEM - puppet last run on mw2119 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:46] PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:47] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:47] PROBLEM - puppet last run on ms-be2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:47] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:48] PROBLEM - puppet last run on ms-be2027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:48] PROBLEM - puppet last run on db2070 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:49] PROBLEM - puppet last run on labtestvirt2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:49] PROBLEM - puppet last run on pybal-test2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:50] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:56] PROBLEM - puppet last run on es2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:58] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:58] PROBLEM - puppet last run on ms-be2038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:58] PROBLEM - puppet last run on ms-be2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:06] PROBLEM - puppet last run on labtestservices2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:06] PROBLEM - puppet last run on elastic2036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:06] PROBLEM - puppet last run on elastic2035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:06] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:06] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:06] PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:06] PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:07] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:09] PROBLEM - puppet last run on scb2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:09] PROBLEM - puppet last run on labstore2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:10] PROBLEM - puppet last run on mw2247 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:10] PROBLEM - puppet last run on ms-be2029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:12] volans: yup, running now [15:45:16] PROBLEM - puppet last run on restbase2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:16] PROBLEM - puppet last run on mw2176 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:16] PROBLEM - puppet last run on ms-fe2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:16] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:57] PROBLEM - puppet last run on cp4027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:06] PROBLEM - puppet last run on ms-fe2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:16] PROBLEM - puppet last run on thumbor2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:46] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:47:50] standby for recovery shower [15:47:56] RECOVERY - puppet last run on elastic2034 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:48:01] wee [15:48:36] RECOVERY - puppet last run on wtp2016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:48:37] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:48:46] RECOVERY - puppet last run on restbase2009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:48:56] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:16] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:16] PROBLEM - puppet last run on mw1320 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:16] RECOVERY - puppet last run on es2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:49:16] RECOVERY - puppet last run on mw2218 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:49:17] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:18] * volans standing by [15:49:26] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:49:26] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:49:26] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:27] RECOVERY - puppet last run on db2087 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:49:27] RECOVERY - puppet last run on db2089 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:27] RECOVERY - puppet last run on es2017 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:49:27] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:27] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:49:28] RECOVERY - puppet last run on restbase-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:36] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:49:37] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:37] RECOVERY - puppet last run on mw2198 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:49:37] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:37] PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:37] PROBLEM - puppet last run on restbase1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:37] RECOVERY - puppet last run on ms-be2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:37] PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:38] RECOVERY - puppet last run on es2016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:49:38] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:49:39] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:49:39] RECOVERY - puppet last run on db2086 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:49:46] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:46] RECOVERY - puppet last run on cp2020 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:49:46] RECOVERY - puppet last run on ms-be2024 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:49:46] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:49:47] RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:49:47] RECOVERY - puppet last run on mw2119 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:49:47] RECOVERY - puppet last run on ms-be2028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:48] RECOVERY - puppet last run on db2070 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:49:48] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:49:49] RECOVERY - puppet last run on ms-be2027 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:50:06] RECOVERY - puppet last run on labtestservices2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:50:06] RECOVERY - puppet last run on elastic2036 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:50:06] RECOVERY - puppet last run on elastic2035 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:50:06] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:50:06] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:07] RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:50:07] RECOVERY - puppet last run on mw2216 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:08] RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:10] RECOVERY - puppet last run on scb2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:50:11] RECOVERY - puppet last run on mw2247 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:50:11] RECOVERY - puppet last run on labstore2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:12] RECOVERY - puppet last run on ms-be2029 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:50:16] RECOVERY - puppet last run on restbase2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:16] RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:50:16] RECOVERY - puppet last run on ms-fe2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:50:16] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:44] !log upgrading basic packages on role::cache::upload [15:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:58] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/381903 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [15:51:06] RECOVERY - puppet last run on ms-fe2007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:51:16] RECOVERY - puppet last run on thumbor2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:54:16] RECOVERY - puppet last run on mw1320 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:54:17] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:54:27] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:54:36] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:54:36] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:54:37] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:54:37] RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:54:37] RECOVERY - puppet last run on restbase1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:37] RECOVERY - puppet last run on kubestagetcd1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:54:46] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:54:46] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:56:06] RECOVERY - puppet last run on cp4027 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:57:53] !log upgrading basic packages on role::cache::text [15:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:26] RECOVERY - puppet last run on cp4026 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [15:59:16] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:59:26] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-node-exporter] [16:00:05] godog, moritzm, and _joe_: #bothumor I � Unicode. All rise for Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171003T1600). [16:00:05] Krinkle and Amir1: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:13] o/ [16:00:52] cp2024 just a race condition [16:01:12] <_joe_> Amir1: I have a meeting right now, let me see if godog is available [16:01:18] I am yeah [16:01:24] I'll take a look [16:01:33] Thanks [16:01:36] o/ [16:02:30] Krinkle: I'll start with yours [16:02:48] godog: OK. Could you apply it to canary servers first? [16:02:59] <_joe_> godog: I don't think that patch should go via puppetswat [16:02:59] In particular just wanna doubel check on XWD just to be safe. [16:04:26] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:04:35] _joe_: I believe we were in this position last week. [16:04:38] (03PS2) 10Dzahn: Add am.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/378395 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [16:04:56] It's a one line config change to HHVM that makes it match the config value MediaWiki unconditionally sets on every request already. [16:05:01] Tested in beta for a week as well. [16:05:48] mutante: thanks (if you are looking at those patches that is) [16:05:56] Or is the issue that we want the restarts to be spaced out? E.g. would take too long for swat? [16:06:25] I would imagine puppet doing that already by default. [16:06:32] so according to the guidelines in https://wikitech.wikimedia.org/wiki/PuppetSWAT it is a bit invasive alright [16:07:18] (03CR) 10Dzahn: [C: 032] Add am.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/378395 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [16:07:19] i'll take the DNS change [16:07:23] not sure about Apache yet [16:08:30] ah, easy enough, ok [16:08:39] godog: author consent (Tim said it's fine to roll out any time), 2 +1's, tested on beta cluster, not an apache config change, trivial complexity in the patch itself. [16:08:47] (03PS2) 10Dzahn: Apache config for am.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/378396 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [16:08:52] Anyway, I'm okay waiting, it's very low priority patch, hence in SWAT. [16:09:15] (03PS1) 10Gehel: osm: ensure we do have a maxInterval on OSM replication [puppet] - 10https://gerrit.wikimedia.org/r/382016 [16:11:00] just arrived at train station. no worries i will finish the Apache thing in half an hour or so [16:11:08] am.wm.org that is. bbiaw [16:11:16] Krinkle: ack, thanks, yeah it'd be ok to schedule some time and deploy it [16:11:58] godog: I thought we did that last week, for today :) - Anyway, so, who from ops and when works best? [16:12:33] eddiegp: re: https://gerrit.wikimedia.org/r/#/c/381050 please seek consesus with the traffic team [16:14:19] godog: Have them review the patch and schedule it for another puppet swat afterwards or ask them to deploy it themselves? [16:14:34] Krinkle: heh, moritzm and elukey would probably a good start [16:15:02] It's my first puppet patch, so I wasn't really sure where to go ;) [16:15:20] eddiegp: np, preferrably the latter [16:15:54] Okay, I'll ask about it in #wikimedia-traffic then. [16:16:20] (03CR) 10Filippo Giunchedi: "LGTM, just a nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/382010 (owner: 10BBlack) [16:17:24] thanks! [16:19:00] (03PS1) 10Gehel: maps: add a scap dsh group for maps servers [puppet] - 10https://gerrit.wikimedia.org/r/382017 [16:23:38] (03PS2) 10Gehel: maps: add a scap dsh group for maps servers [puppet] - 10https://gerrit.wikimedia.org/r/382017 [16:36:31] 10Operations, 10Operations-Software-Development, 10Continuous-Integration-Infrastructure (shipyard): New tool to track package updates/status for hosts and images (debmonitor) - https://phabricator.wikimedia.org/T167504#3653980 (10hashar) [16:36:38] 10Operations, 10Cloud-VPS, 10User-bd808, 10cloud-services-team (Kanban): End self-service new Trusty instance creation in Cloud VPS; standardize on Debian base images - https://phabricator.wikimedia.org/T161899#3653981 (10bd808) [16:44:06] 10Operations, 10monitoring, 10Patch-For-Review: Review check_raid_hpssacli frequency - https://phabricator.wikimedia.org/T173311#3653993 (10jcrespo) > What is the high-level/human description of the policy we want to enforce for database servers? "replicas that are on a critical path of lagging if they have... [16:44:11] 10Operations, 10Cloud-VPS, 10User-bd808, 10cloud-services-team (Kanban): End self-service new Trusty instance creation in Cloud VPS; standardize on Debian base images - https://phabricator.wikimedia.org/T161899#3653994 (10bd808) Retitled the task yet again. The Foundation production networks are getting cl... [16:51:11] 10Operations, 10monitoring, 10Patch-For-Review: Review check_raid_hpssacli frequency - https://phabricator.wikimedia.org/T173311#3654031 (10faidon) Would it make sense to lower the interval for all `role::mariadb::core`, irrespective of mysql_role, to make this a simpler target? We can take the extra hit of... [16:56:45] 10Operations, 10monitoring, 10Patch-For-Review: Review check_raid_hpssacli frequency - https://phabricator.wikimedia.org/T173311#3654041 (10jcrespo) It could work, it was my initial reaction, but I do not have any strong arguments for it (but I on purpose left it parametrizable as something to check back in... [16:57:40] PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [16:58:40] RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [16:58:56] 10Operations, 10cloud-services-team: Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3654046 (10bd808) [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear deployers, time to do the Services – Graphoid / Parsoid / OCG / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171003T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:28] Nothing for ORES today. [17:00:32] But soon. SOON [17:00:32] (03PS2) 10BBlack: check_dpkg: warn (not crit) if packages are held [puppet] - 10https://gerrit.wikimedia.org/r/382010 [17:00:56] (03CR) 10BBlack: check_dpkg: warn (not crit) if packages are held (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/382010 (owner: 10BBlack) [17:11:22] (03PS2) 10BBlack: varnish: remove references to mfLazyLoadReferences [puppet] - 10https://gerrit.wikimedia.org/r/381050 (https://phabricator.wikimedia.org/T175381) (owner: 10EddieGP) [17:12:18] (03CR) 10BBlack: [C: 032] varnish: remove references to mfLazyLoadReferences [puppet] - 10https://gerrit.wikimedia.org/r/381050 (https://phabricator.wikimedia.org/T175381) (owner: 10EddieGP) [17:25:22] 10Operations, 10Puppet, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548#3654132 (10bd808) [17:25:24] 10Operations, 10cloud-services-team: Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3654131 (10bd808) [17:27:30] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:28:05] hmm mutante ^^ [17:28:17] * paladox wonders how did it recover? [17:29:35] paladox: https://media3.giphy.com/media/12NUbkX6p4xOO4/giphy.gif [17:29:45] lol [17:30:18] 10Puppet, 10Toolforge, 10cloud-services-team (Kanban): Switch Toolforge project hosts to the future parser - https://phabricator.wikimedia.org/T177298#3654146 (10bd808) [17:31:35] (03PS3) 10Dzahn: Apache config for am.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/378396 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [17:32:00] paladox: it wasn't me [17:32:07] ok [17:33:48] 10Puppet, 10Toolforge, 10cloud-services-team (Kanban): Switch Toolforge project hosts to the future parser - https://phabricator.wikimedia.org/T177298#3654171 (10bd808) [17:35:43] (03CR) 10Dzahn: [C: 032] Apache config for am.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/378396 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [17:36:54] (03PS3) 10BBlack: check_dpkg: warn (not crit) if packages are held [puppet] - 10https://gerrit.wikimedia.org/r/382010 [17:36:58] (03PS4) 10BBlack: check_dpkg: warn (not crit) if packages are held [puppet] - 10https://gerrit.wikimedia.org/r/382010 [17:43:24] (03CR) 10BBlack: [C: 032] check_dpkg: warn (not crit) if packages are held [puppet] - 10https://gerrit.wikimedia.org/r/382010 (owner: 10BBlack) [17:46:36] (03CR) 10Thcipriani: [C: 031] maps: add a scap dsh group for maps servers [puppet] - 10https://gerrit.wikimedia.org/r/382017 (owner: 10Gehel) [17:49:04] (03PS1) 10EddieGP: wikitech: Align 'contentadmin' and 'sysop' permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382024 (https://phabricator.wikimedia.org/T171208) [17:55:29] 10Operations: Upgrade puppetDB to version 3.2 or newer - https://phabricator.wikimedia.org/T177253#3654348 (10herron) [17:56:44] 10Operations, 10cloud-services-team: Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3654352 (10herron) [17:56:47] 10Operations: Upgrade puppetDB to version 3.2 or newer - https://phabricator.wikimedia.org/T177253#3652246 (10herron) [17:57:48] (03Abandoned) 10ArielGlenn: add facter fact to pull out nginx version [puppet/nginx] - 10https://gerrit.wikimedia.org/r/274683 (owner: 10ArielGlenn) [17:58:35] (03PS1) 10Dzahn: screen-monitor: whitelist deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/382026 (https://phabricator.wikimedia.org/T165348) [17:59:17] (03CR) 10Dzahn: [C: 032] screen-monitor: whitelist deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/382026 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [18:02:27] (03PS1) 10Rush: labweb: move to internal VLAN [dns] - 10https://gerrit.wikimedia.org/r/382027 (https://phabricator.wikimedia.org/T167820) [18:03:58] (03PS1) 10Chad: group0 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382028 [18:06:39] (03PS2) 10Rush: labweb: move to internal VLAN [dns] - 10https://gerrit.wikimedia.org/r/382027 (https://phabricator.wikimedia.org/T167820) [18:08:18] (03PS1) 10Chad: scap prep: default to 4 jobs for fetching submodules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382029 [18:10:24] 10Operations, 10Puppet, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking): Remove references to non-existent mfLazyLoadReferences cookies - https://phabricator.wikimedia.org/T175381#3654415 (10Jdlrobson) 05Open>03Resolved a:03Jdlrobson Thanks @bblack! [18:10:31] (03CR) 10Hoo man: [C: 031] "Fine to merge at any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381642 (https://phabricator.wikimedia.org/T129475) (owner: 10Ladsgroup) [18:11:09] (03PS2) 10Dzahn: add releases-jenkins apache/varnish, move jenkins proxy config [puppet] - 10https://gerrit.wikimedia.org/r/381907 (https://phabricator.wikimedia.org/T164030) [18:11:27] (03PS1) 10Andrew Bogott: labweb1001, 1002: move to internal IP and Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/382030 [18:12:21] (03CR) 10Thcipriani: [C: 031] "nice :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382029 (owner: 10Chad) [18:12:54] (03PS3) 10Dzahn: add releases-jenkins apache/varnish, move jenkins proxy config [puppet] - 10https://gerrit.wikimedia.org/r/381907 (https://phabricator.wikimedia.org/T164030) [18:12:56] (03CR) 10Andrew Bogott: [C: 032] labweb1001, 1002: move to internal IP and Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/382030 (owner: 10Andrew Bogott) [18:14:24] mutante: when you get a sec could you possibly take a look at https://gerrit.wikimedia.org/r/#/c/382027/ [18:14:48] 10Operations, 10cloud-services-team: Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3654441 (10herron) [18:16:11] (03CR) 10Dzahn: [C: 032] add releases-jenkins apache/varnish, move jenkins proxy config [puppet] - 10https://gerrit.wikimedia.org/r/381907 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [18:16:34] (03PS4) 10Dzahn: add releases-jenkins apache/varnish, move jenkins proxy config [puppet] - 10https://gerrit.wikimedia.org/r/381907 (https://phabricator.wikimedia.org/T164030) [18:16:39] 10Operations: Upgrade puppetDB to version 3.2 or newer - https://phabricator.wikimedia.org/T177253#3654444 (10herron) [18:16:57] (03CR) 10Zoranzoki21: [C: 031] group0 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382028 (owner: 10Chad) [18:22:17] (03CR) 10Dzahn: [C: 031] "lgtm, racks B and D match what is in racktables and the networks used" [dns] - 10https://gerrit.wikimedia.org/r/382027 (https://phabricator.wikimedia.org/T167820) (owner: 10Rush) [18:22:38] chasemp: looks correct to me, i checked racktables and compared the network [18:22:52] you will need switch config though [18:22:53] mutante: thank you much [18:22:59] mutante: yep, on it [18:23:02] ok :) [18:23:57] (03CR) 10Rush: [C: 032] labweb: move to internal VLAN [dns] - 10https://gerrit.wikimedia.org/r/382027 (https://phabricator.wikimedia.org/T167820) (owner: 10Rush) [18:24:27] !log auth-dnsupdate on bahan for 382027 [18:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:51] !log change to private vlan for labweb1002 on asw2-d-eqiad [18:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:03] !log change to private vlan for labweb1001 on asw-b-eqiad [18:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:26] (03PS2) 10Dzahn: screen-monitor: whitelist deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/382026 (https://phabricator.wikimedia.org/T165348) [18:29:04] (03PS3) 10Dzahn: add releases-jenkins to misc-web cluster [dns] - 10https://gerrit.wikimedia.org/r/381903 (https://phabricator.wikimedia.org/T164030) [18:32:58] (03PS2) 10Chad: scap prep: default to 4 jobs for fetching submodules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382029 [18:33:48] (03CR) 10Thcipriani: [C: 031] scap prep: default to 4 jobs for fetching submodules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382029 (owner: 10Chad) [18:35:54] 10Operations, 10ops-eqiad, 10DC-Ops: Multiple servers in eqiad D8 showing PSU failures - https://phabricator.wikimedia.org/T177227#3654544 (10Cmjohnson) 05Open>03Resolved The breaker had tripped for 1 of the phases on side A. Reset [18:36:37] (03CR) 10Dzahn: [C: 032] add releases-jenkins to misc-web cluster [dns] - 10https://gerrit.wikimedia.org/r/381903 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [18:38:31] RECOVERY - IPMI Sensor Status on analytics1002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [18:38:31] RECOVERY - IPMI Sensor Status on kafka1020 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [18:38:31] RECOVERY - IPMI Sensor Status on kafka1018 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [18:38:31] RECOVERY - IPMI Sensor Status on conf1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [18:40:18] (03CR) 10Chad: [C: 032] scap prep: default to 4 jobs for fetching submodules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382029 (owner: 10Chad) [18:43:52] !log demon@tin Started scap: bootstrap wmf.2 [18:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:05] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1056 - https://phabricator.wikimedia.org/T177171#3654609 (10Cmjohnson) Disk replaced and rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spu... [18:45:35] (03PS1) 10Andrew Bogott: labweb: restore mgmt IP entries, move internal IPs elsewhere [dns] - 10https://gerrit.wikimedia.org/r/382031 [18:45:55] (03CR) 10Zoranzoki21: [C: 031] scap prep: default to 4 jobs for fetching submodules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382029 (owner: 10Chad) [18:46:29] (03PS3) 10Gehel: maps: add a scap dsh group for maps servers [puppet] - 10https://gerrit.wikimedia.org/r/382017 [18:46:45] (03Merged) 10jenkins-bot: scap prep: default to 4 jobs for fetching submodules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382029 (owner: 10Chad) [18:47:00] (03CR) 10jenkins-bot: scap prep: default to 4 jobs for fetching submodules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382029 (owner: 10Chad) [18:47:10] (03CR) 10Gehel: [C: 032] maps: add a scap dsh group for maps servers [puppet] - 10https://gerrit.wikimedia.org/r/382017 (owner: 10Gehel) [18:47:25] (03CR) 10Pnorman: [C: 031] osm: ensure we do have a maxInterval on OSM replication [puppet] - 10https://gerrit.wikimedia.org/r/382016 (owner: 10Gehel) [18:49:30] PROBLEM - HHVM rendering on mw2213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:20] RECOVERY - HHVM rendering on mw2213 is OK: HTTP OK: HTTP/1.1 200 OK - 75588 bytes in 0.291 second response time [18:56:06] (03CR) 10Andrew Bogott: [C: 032] labweb: restore mgmt IP entries, move internal IPs elsewhere [dns] - 10https://gerrit.wikimedia.org/r/382031 (owner: 10Andrew Bogott) [19:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171003T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:01:09] !log Starting MediaWiki train for 1.30.1-wmf.2 [19:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:47] 10Operations, 10ops-eqiad, 10DC-Ops: Multiple servers in eqiad D8 showing PSU failures - https://phabricator.wikimedia.org/T177227#3654750 (10herron) Thanks @Cmjohnson! This cleared 4 of the open alerts. Oddly there are 3 hosts in this rack with open power supply alerts still. Is there any indication of a... [19:08:22] 10Operations, 10ops-eqiad, 10DC-Ops: Multiple servers in eqiad D8 showing PSU failures - https://phabricator.wikimedia.org/T177227#3654768 (10herron) 05Resolved>03Open [19:17:42] !log demon@tin Finished scap: bootstrap wmf.2 (duration: 33m 50s) [19:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:18] no_justification: you're doing wmf.2? I thought I was taking it for you [19:20:21] hahah and I got the v. wrong [19:33:14] !log demon@tin Synchronized scap/plugins/prep.py: no-op (duration: 00m 51s) [19:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:57] (03CR) 10HaeB: "Thanks Mforns - to your points:" [puppet] - 10https://gerrit.wikimedia.org/r/379829 (https://phabricator.wikimedia.org/T175395) (owner: 10Bmansurov) [19:50:56] (03CR) 10Chad: [C: 032] group0 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382028 (owner: 10Chad) [19:54:58] (03Merged) 10jenkins-bot: group0 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382028 (owner: 10Chad) [19:56:20] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3654983 (10Dzahn) There are no more alerts now. All the screens/tmux on puppetmaster are closed now. Deployment servers and rhenium are whitelisted. So right now Icinga is clean. [19:56:21] (03CR) 10jenkins-bot: group0 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382028 (owner: 10Chad) [19:58:34] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.2 [19:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:39] 10Operations, 10RelEng-Archive-FY201718-Q1, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3655005 (10Dzahn) a:05demon>03Dzahn [20:08:57] (03PS1) 10Dzahn: releases: drop /ci/ suffix for jenkins-proxy, unify templates [puppet] - 10https://gerrit.wikimedia.org/r/382038 (https://phabricator.wikimedia.org/T164030) [20:10:39] Errm, I get an HTTP 500 error trying to go to a user page on mediawiki.org: "via cp3043 cp3043, Varnish XID 814068162 Error: 500, Internal Server Error at Tue, 03 Oct 2017 20:09:18 GMT" [20:10:55] Is that... known? Or even an actual issue? [20:11:12] Working on it [20:11:16] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: vcp port down on fasw-c-codfw - https://phabricator.wikimedia.org/T177332#3655009 (10ayounsi) [20:11:18] Fix already in the stream [20:11:32] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: vcp port down on fasw-c-codfw - https://phabricator.wikimedia.org/T177332#3655022 (10ayounsi) [20:11:33] (global user pages a little busted, so not all users affected) [20:11:35] thanks. [20:11:45] yw. Just showed up during deploy and legoktm already provided me with a patch <3 [20:14:38] (03CR) 10Chad: [C: 031] releases: drop /ci/ suffix for jenkins-proxy, unify templates [puppet] - 10https://gerrit.wikimedia.org/r/382038 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [20:17:40] (03CR) 10Krinkle: releases: drop /ci/ suffix for jenkins-proxy, unify templates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/382038 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [20:17:58] !log demon@tin Synchronized php-1.31.0-wmf.2/extensions/GlobalUserPage/includes/Hooks.php: fix broken thing (duration: 00m 51s) [20:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:11] (03CR) 10Zoranzoki21: [C: 031] releases: drop /ci/ suffix for jenkins-proxy, unify templates [puppet] - 10https://gerrit.wikimedia.org/r/382038 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [20:18:19] (03PS2) 10Zoranzoki21: releases: drop /ci/ suffix for jenkins-proxy, unify templates [puppet] - 10https://gerrit.wikimedia.org/r/382038 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [20:19:24] (03PS1) 10Andrew Bogott: labweb1001 and 2: fix typo for mgmt IPs [dns] - 10https://gerrit.wikimedia.org/r/382044 [20:20:01] (03CR) 10Andrew Bogott: [C: 032] labweb1001 and 2: fix typo for mgmt IPs [dns] - 10https://gerrit.wikimedia.org/r/382044 (owner: 10Andrew Bogott) [20:21:49] 10Operations, 10Contributors-Team, 10MobileFrontend, 10wikidiff2, and 2 others: Diff page consistently produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3655031 (10MaxSem) Added you to the project, now you can ssh into hosts like `deployment-mediawiki04.eqiad.wmflabs`... [20:23:53] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: Interface error on fasw-c-eqiad:vcp-255/1/0 - https://phabricator.wikimedia.org/T177333#3655036 (10ayounsi) [20:29:00] !log Start regenerating map tiles on codfw for z0-z12 - T176252 [20:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:07] T176252: Regenerate tiles - https://phabricator.wikimedia.org/T176252 [20:32:42] PROBLEM - DNS labweb1001.mgmt on labweb1001.mgmt is CRITICAL: DNS CRITICAL - expected 10.65.4.121 but got 10.64.4.121 [20:46:52] (03PS1) 10Andrew Bogott: labweb1001, 1002: initial site def [puppet] - 10https://gerrit.wikimedia.org/r/382057 [20:47:31] (03CR) 10Andrew Bogott: [C: 032] labweb1001, 1002: initial site def [puppet] - 10https://gerrit.wikimedia.org/r/382057 (owner: 10Andrew Bogott) [20:59:04] (03Draft1) 10Paladox: Gerrit: Set nullNamePatternMatchesAll to true [puppet] - 10https://gerrit.wikimedia.org/r/382065 [20:59:08] (03PS2) 10Paladox: Gerrit: Set nullNamePatternMatchesAll to true [puppet] - 10https://gerrit.wikimedia.org/r/382065 [21:17:19] (03PS3) 10EBernhardson: [cirrus] Force native script for super noop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376027 (https://phabricator.wikimedia.org/T174652) (owner: 10DCausse) [21:32:42] RECOVERY - DNS labweb1001.mgmt on labweb1001.mgmt is OK: DNS OK: 0.012 seconds response time. labweb1001.mgmt.eqiad.wmnet returns 10.65.4.121 [22:01:09] (03PS1) 10Andrew Bogott: Add a role for a trivial file download server [puppet] - 10https://gerrit.wikimedia.org/r/382069 (https://phabricator.wikimedia.org/T177293) [22:01:38] (03CR) 10jerkins-bot: [V: 04-1] Add a role for a trivial file download server [puppet] - 10https://gerrit.wikimedia.org/r/382069 (https://phabricator.wikimedia.org/T177293) (owner: 10Andrew Bogott) [22:03:09] (03PS2) 10Andrew Bogott: Add a role for a trivial file download server [puppet] - 10https://gerrit.wikimedia.org/r/382069 (https://phabricator.wikimedia.org/T177293) [22:03:42] (03CR) 10jerkins-bot: [V: 04-1] Add a role for a trivial file download server [puppet] - 10https://gerrit.wikimedia.org/r/382069 (https://phabricator.wikimedia.org/T177293) (owner: 10Andrew Bogott) [22:06:29] (03PS3) 10Andrew Bogott: add a role for a trivial file download server [puppet] - 10https://gerrit.wikimedia.org/r/382069 (https://phabricator.wikimedia.org/T177293) [22:07:06] (03CR) 10jerkins-bot: [V: 04-1] add a role for a trivial file download server [puppet] - 10https://gerrit.wikimedia.org/r/382069 (https://phabricator.wikimedia.org/T177293) (owner: 10Andrew Bogott) [22:14:23] (03PS4) 10Andrew Bogott: add a role for a trivial file download server [puppet] - 10https://gerrit.wikimedia.org/r/382069 (https://phabricator.wikimedia.org/T177293) [22:14:27] (03CR) 10jerkins-bot: [V: 04-1] add a role for a trivial file download server [puppet] - 10https://gerrit.wikimedia.org/r/382069 (https://phabricator.wikimedia.org/T177293) (owner: 10Andrew Bogott) [22:14:29] (03PS5) 10Andrew Bogott: Add a profile for a trivial file download server [puppet] - 10https://gerrit.wikimedia.org/r/382069 (https://phabricator.wikimedia.org/T177293) [22:14:43] (03CR) 10jerkins-bot: [V: 04-1] Add a profile for a trivial file download server [puppet] - 10https://gerrit.wikimedia.org/r/382069 (https://phabricator.wikimedia.org/T177293) (owner: 10Andrew Bogott) [22:20:38] !log bsitzmann@tin Started deploy [mobileapps/deploy@82aa7d6]: Update mobileapps to 5dc0c02 (T175762 T177001 T176525 T176517 T176519) [22:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:49] T176525: Parenthetical: New edge case with nested brackets - https://phabricator.wikimedia.org/T176525 [22:20:49] T177001: Run fundraising test via announcment cards in France - https://phabricator.wikimedia.org/T177001 [22:20:49] T176517: Pages that are redirects give empty summary objects - https://phabricator.wikimedia.org/T176517 [22:20:49] T176519: Old templates can lead to sup elements inside summary - https://phabricator.wikimedia.org/T176519 [22:20:49] T175762: Add featured feed ITN entry for scowiki - https://phabricator.wikimedia.org/T175762 [22:23:44] (03PS1) 10Volans: Rakefile: return empty list if no files were linted [puppet] - 10https://gerrit.wikimedia.org/r/382076 [22:23:49] andrewbogott: ^^^ [22:24:23] (03CR) 10Volans: "linter.problems return nil if no files were linted." [puppet] - 10https://gerrit.wikimedia.org/r/382076 (owner: 10Volans) [22:24:31] (03CR) 10Andrew Bogott: [C: 031] Rakefile: return empty list if no files were linted [puppet] - 10https://gerrit.wikimedia.org/r/382076 (owner: 10Volans) [22:24:37] (03CR) 10Volans: [C: 032] Rakefile: return empty list if no files were linted [puppet] - 10https://gerrit.wikimedia.org/r/382076 (owner: 10Volans) [22:25:54] (03PS6) 10Volans: Add a profile for a trivial file download server [puppet] - 10https://gerrit.wikimedia.org/r/382069 (https://phabricator.wikimedia.org/T177293) (owner: 10Andrew Bogott) [22:26:53] !log bsitzmann@tin Finished deploy [mobileapps/deploy@82aa7d6]: Update mobileapps to 5dc0c02 (T175762 T177001 T176525 T176517 T176519) (duration: 06m 14s) [22:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:04] T176525: Parenthetical: New edge case with nested brackets - https://phabricator.wikimedia.org/T176525 [22:27:04] T177001: Run fundraising test via announcment cards in France - https://phabricator.wikimedia.org/T177001 [22:27:04] T176517: Pages that are redirects give empty summary objects - https://phabricator.wikimedia.org/T176517 [22:27:04] T176519: Old templates can lead to sup elements inside summary - https://phabricator.wikimedia.org/T176519 [22:27:04] T175762: Add featured feed ITN entry for scowiki - https://phabricator.wikimedia.org/T175762 [22:27:49] (03CR) 10Andrew Bogott: [C: 032] Add a profile for a trivial file download server [puppet] - 10https://gerrit.wikimedia.org/r/382069 (https://phabricator.wikimedia.org/T177293) (owner: 10Andrew Bogott) [22:32:12] (03PS1) 10Andrew Bogott: downloadserver: fix path to nginx config [puppet] - 10https://gerrit.wikimedia.org/r/382079 [22:35:10] (03CR) 10Andrew Bogott: [C: 032] downloadserver: fix path to nginx config [puppet] - 10https://gerrit.wikimedia.org/r/382079 (owner: 10Andrew Bogott) [22:35:51] (03PS1) 10Dmaza: Enable AbuseFilter runtime profile on Portuguese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382080 (https://phabricator.wikimedia.org/T177336) [22:39:02] (03PS1) 10Andrew Bogott: yet another fix after a hurried role to profile move [puppet] - 10https://gerrit.wikimedia.org/r/382081 [22:39:54] (03CR) 10Andrew Bogott: [C: 032] yet another fix after a hurried role to profile move [puppet] - 10https://gerrit.wikimedia.org/r/382081 (owner: 10Andrew Bogott) [22:43:24] 10Operations, 10Community-Tech, 10MediaWiki-General-or-Unknown, 10Stewards-and-global-tools (Temporary-UserRights): Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3655455 (10kaldari) [22:50:22] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [22:58:09] (03PS3) 10Dzahn: releases: drop /ci/ suffix for jenkins-proxy, unify templates [puppet] - 10https://gerrit.wikimedia.org/r/382038 (https://phabricator.wikimedia.org/T164030) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171003T2300). [23:00:04] ebernhardson and subbu: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:25] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3655525 (10aaron) >>! In T175672#3635865, @aaron wrote: >>>! In T175672#3635140,... [23:03:09] \o [23:04:31] i suppose i can ship the patches. Mines a no-op just prepping for future train [23:04:54] (03CR) 10EBernhardson: [C: 032] [cirrus] Force native script for super noop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376027 (https://phabricator.wikimedia.org/T174652) (owner: 10DCausse) [23:05:09] * subbu is around [23:07:39] subbu: kk, will merge yours next [23:07:48] k [23:08:41] (03Merged) 10jenkins-bot: [cirrus] Force native script for super noop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376027 (https://phabricator.wikimedia.org/T174652) (owner: 10DCausse) [23:08:51] (03CR) 10jenkins-bot: [cirrus] Force native script for super noop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376027 (https://phabricator.wikimedia.org/T174652) (owner: 10DCausse) [23:10:47] (03PS15) 10EBernhardson: Enable RemexHTML on wikitech and eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) (owner: 10Zoranzoki21) [23:11:28] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: Pre-configure cirrussearch for 5.5.x upgrade (duration: 00m 51s) [23:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:34] (03CR) 10EBernhardson: [C: 032] Enable RemexHTML on wikitech and eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) (owner: 10Zoranzoki21) [23:14:20] (03Merged) 10jenkins-bot: Enable RemexHTML on wikitech and eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) (owner: 10Zoranzoki21) [23:16:10] (03CR) 10jenkins-bot: Enable RemexHTML on wikitech and eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) (owner: 10Zoranzoki21) [23:16:18] subbu: patch is live on mwdebug1001 if theres some testing to do [23:17:58] ebernhardson, asking James_F and TimStarling but I think not. [23:18:23] it's already been tested, when it was deployed to testwiki etc. [23:18:31] ok, syncing out [23:20:06] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: T175971 T176088 Enable RemexHTML on wikitech and eswikiversity (duration: 00m 51s) [23:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:13] T175971: Enable RemexHTML on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T175971 [23:20:14] T176088: Enable RemexHTML on eswikiversity - https://phabricator.wikimedia.org/T176088 [23:21:18] subbu: you're synced out [23:21:24] alright! ty. [23:21:32] * subbu takes a look at wikitech pages [23:22:17] Yeah, looks good to me.\ [23:23:39] lgtm too [23:23:53] 4 down, 900 more to go :D [23:24:09] just follow the usualy pattern 4.. 10.. 20.. 900 :P [23:30:00] (03PS4) 10Dzahn: releases: drop /ci/ suffix for jenkins-proxy, unify templates [puppet] - 10https://gerrit.wikimedia.org/r/382038 (https://phabricator.wikimedia.org/T164030) [23:46:22] !log Start regenerating map tiles on eqiad for z0-z12 - T176252 [23:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:28] T176252: Regenerate tiles - https://phabricator.wikimedia.org/T176252 [23:53:38] (03CR) 10Dzahn: [C: 032] releases: drop /ci/ suffix for jenkins-proxy, unify templates [puppet] - 10https://gerrit.wikimedia.org/r/382038 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [23:54:07] andrewbogott: umerged change on puppetmaster [23:54:19] the nginx config [23:54:40] labs/downloadserver... will merge it [23:55:02] mutante: thanks, sorry [23:55:14] np, and ..done [23:55:22] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [23:55:30] 10Operations, 10Community-Tech, 10MediaWiki-General-or-Unknown, 10Stewards-and-global-tools (Temporary-UserRights): Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3655878 (10kaldari) If this is going to be done by a maintenance script and a cron job,... [23:57:06] and .. i should have compiled that [23:57:22] RECOVERY - MegaRAID on db1056 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [23:59:33] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues