[00:01:54] Could someone log into hafnium to see if there's anything wrong with statsv.py? [00:02:05] It looks like it crashed in the weekend [00:02:20] it's still down [00:02:37] and bandwidth and CPU are up a lot. But it's not reporting anythingto graphite. [00:04:59] Krinkle: see above, in the middle of something important [00:07:09] (03CR) 10Avicennasis: "We are a small, small community, yes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237049 (https://phabricator.wikimedia.org/T111753) (owner: 10MarcoAurelio) [00:13:35] (03PS1) 10Tim Landscheidt: WIP: Tools: Deploy local package management key [puppet] - 10https://gerrit.wikimedia.org/r/240021 (https://phabricator.wikimedia.org/T112699) [00:16:15] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:21:17] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1661764 (10Mattflaschen) Noted, thanks. What's the status now? [00:32:43] !log Disabled Puppet for 24h on hafnium and stopped ganglia-monitor. gmond was saturating CPU. [00:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:33:50] 6operations, 10Analytics-EventLogging, 7Graphite: Statsv down since 2015-09-20 07:53 - https://phabricator.wikimedia.org/T113315#1661778 (10ori) There were two gmond processes running, each saturating a CPU core. As a temporary measure, I stopped `ganglia-monitor` and disabled Puppet to prevent it from getti... [00:42:30] 6operations, 10Traffic, 5Patch-For-Review: Saving preferences or blocking (and probably various other things) give 403 error - https://phabricator.wikimedia.org/T113319#1661783 (10faidon) 5Open>3Resolved a:3faidon Should be fixed -- sorry for the trouble. [00:52:55] (03PS3) 10Yuvipanda: logstash: Enable logging via stashbot in irc channel wikimedia-analytics [puppet] - 10https://gerrit.wikimedia.org/r/240014 (https://phabricator.wikimedia.org/T111393) (owner: 10Madhuvishy) [00:53:17] (03CR) 10Yuvipanda: [C: 032 V: 032] "I did the hiera change too." [puppet] - 10https://gerrit.wikimedia.org/r/240014 (https://phabricator.wikimedia.org/T111393) (owner: 10Madhuvishy) [00:59:58] 6operations, 10Analytics-EventLogging, 7Graphite: Statsv down since 2015-09-20 07:53 - https://phabricator.wikimedia.org/T113315#1661805 (10ori) The statsv error log is filled with: ``` Traceback (most recent call last): File "/srv/deployment/statsv/statsv/statsv.py", line 51, in data = json.... [01:04:20] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1661810 (10Legoktm) 5Open>3stalled I spoke with @bblack on IRC, when we renew the unified cert in mid-October, we will also add w.wiki to it. Ma... [01:08:09] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#1661814 (10bmansurov) I'm able to ssh to stat1002. I, however, am not able to login at https://hue.wikimedia.org/accounts/login/?next=/ . @Ottomata, does my LDAP account need... [01:08:22] (03PS1) 10Dzahn: (WIP) mailman: script to rename list [puppet] - 10https://gerrit.wikimedia.org/r/240024 [01:09:06] (03CR) 10jenkins-bot: [V: 04-1] (WIP) mailman: script to rename list [puppet] - 10https://gerrit.wikimedia.org/r/240024 (owner: 10Dzahn) [01:14:26] PROBLEM - Disk space on mw1015 is CRITICAL: DISK CRITICAL - free space: / 8162 MB (3% inode=94%) [01:19:48] (03PS2) 10Dzahn: (WIP) mailman: script to rename list [puppet] - 10https://gerrit.wikimedia.org/r/240024 [01:20:50] (03CR) 10jenkins-bot: [V: 04-1] (WIP) mailman: script to rename list [puppet] - 10https://gerrit.wikimedia.org/r/240024 (owner: 10Dzahn) [01:25:22] (03PS37) 10Ori.livneh: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [01:25:55] (03CR) 10jenkins-bot: [V: 04-1] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [01:27:21] (03PS38) 10Ori.livneh: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [01:28:05] (03CR) 10jenkins-bot: [V: 04-1] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [01:28:40] (03CR) 10Ori.livneh: "PS36/37: de-parametrized a bunch of settings that did not need to be parameters, like the sentry user/group names, the config file locatio" [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [01:30:28] (03PS39) 10Ori.livneh: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [01:35:31] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Dan Foy - https://phabricator.wikimedia.org/T113324#1661844 (10kevinator) [01:37:00] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Zhou Zhou - https://phabricator.wikimedia.org/T113325#1661845 (10kevinator) 3NEW [01:38:08] 6operations, 10Analytics-EventLogging, 7Graphite: Statsv down since 2015-09-20 07:53 - https://phabricator.wikimedia.org/T113315#1661854 (10Krinkle) p:5Unbreak!>3High [01:40:25] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3048_v6 [01:40:26] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 78 data above and 9 below the confidence bounds [01:42:15] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [01:44:18] (03PS3) 10Dzahn: (WIP) mailman: script to rename list [puppet] - 10https://gerrit.wikimedia.org/r/240024 [01:44:26] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3045_v6 [01:46:16] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [01:46:26] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 connecting: (unnamed) not-conn: cp3018_v6 [01:48:17] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp4020_v6 [01:50:04] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1661871 (10Dzahn) @egalvezwmf Please take a look at L3 and sign it if you haven't already. Thank you! [01:50:06] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [01:50:06] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [01:53:10] !log sodium - deleted salt key, revoked puppet cert, rm from icinga .. [01:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:54:11] 6operations: check mailman queue size monitoring - https://phabricator.wikimedia.org/T113326#1661872 (10Dzahn) 3NEW [01:54:20] 6operations: check mailman queue size monitoring - https://phabricator.wikimedia.org/T113326#1661879 (10Dzahn) a:3Dzahn [01:54:31] (03CR) 1020after4: "I inteded for it to handle loggers passed in explicitly or by the decorator, I must have misunderstood how python keyword args work." [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [02:19:41] !log l10nupdate@tin Synchronized php-1.26wmf23/cache/l10n: l10nupdate for 1.26wmf23 (duration: 06m 00s) [02:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:57] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf23) at 2015-09-22 02:22:56+00:00 [02:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:46] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2014_v6 [02:30:46] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [02:33:27] PROBLEM - Disk space on restbase2001 is CRITICAL: DISK CRITICAL - free space: /var 139212 MB (3% inode=99%) [02:33:47] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1661894 (10Dzahn) Hi @JUnikowski_WMF, could you take a look at L3 and sign it? Thank you, Daniel [02:34:07] 7Puppet: Nuyaml_backend does not allow binary Hiera data - https://phabricator.wikimedia.org/T113328#1661895 (10scfc) 3NEW [02:35:01] (03PS2) 10Tim Landscheidt: WIP: Tools: Deploy local package management key [puppet] - 10https://gerrit.wikimedia.org/r/240021 (https://phabricator.wikimedia.org/T112699) [02:41:47] 6operations, 10Wikimedia-Mailing-lists, 7Documentation: Overhaul Mailman documentation - https://phabricator.wikimedia.org/T109534#1661922 (10Dzahn) i made some changes on the wikitech page, replacing "sodium" with "fermium, some lighttpd stuff with Apache... as always with "fix docs on wiki"-tickets they... [02:41:56] PROBLEM - Disk space on restbase2002 is CRITICAL: DISK CRITICAL - free space: /var 135915 MB (3% inode=99%) [02:44:26] PROBLEM - Disk space on restbase2003 is CRITICAL: DISK CRITICAL - free space: /var 136137 MB (3% inode=99%) [02:52:51] 6operations, 10Analytics-EventLogging, 7Graphite: Statsv down since 2015-09-20 07:53 - https://phabricator.wikimedia.org/T113315#1661927 (10Ottomata) Hm, not sure why this would happen all of the sudden at this time. The only change that happened today was a restart of eventlogging to change a setting on th... [02:57:15] (03CR) 10Tim Landscheidt: [C: 04-1] WIP: Tools: Deploy local package management key [puppet] - 10https://gerrit.wikimedia.org/r/240021 (https://phabricator.wikimedia.org/T112699) (owner: 10Tim Landscheidt) [03:04:16] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:06:33] 6operations, 10Annual-Report: create git/gerrit repo for annual report 2015 - https://phabricator.wikimedia.org/T112928#1661932 (10Dzahn) We should just use "annualreport" indeed. There exists ./2014, we will simply create ./2015 and done. And i send an email to Mule about how to clone from it etc. [03:20:35] RECOVERY - Disk space on labstore1002 is OK: DISK OK [03:23:27] 6operations, 10Annual-Report, 5Patch-For-Review: create git/gerrit repo for annual report 2015 - https://phabricator.wikimedia.org/T112928#1661947 (10Dzahn) Hi @Stephmonette added you here in the ticket to show notifications. I think Liam does not have the Phabricator user yet. [03:26:22] 6operations, 10Annual-Report, 5Patch-For-Review: create git/gerrit repo for annual report 2015 - https://phabricator.wikimedia.org/T112928#1661949 (10Dzahn) I added a group in Gerrit called MuleDesign and added Steph and Liam as members. The group name can be used when adding reviewers. Permissions can be b... [03:54:07] (03CR) 10Chad: "Why do we need to pass loggers anymore? I thought that's why we ended up here in the first place..." [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [03:55:54] (03CR) 1020after4: "It's no longer necessary, but I thought it'd be nice to retain the possibility to. If nobody sees a need for this then I suppose we could " [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [04:02:06] do ya'll, opsen, want the mediawiki/tools/scap repo still announced in here? [04:04:15] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:05:54] greg-g: no [04:06:56] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0] [04:13:35] no, where does the grrrit config live.... [04:13:37] now* [04:14:12] yuv :( [04:20:28] https://gerrit.wikimedia.org/r/#/c/240030/1 [04:24:47] <_joe_> greg-g: yes [04:25:01] <_joe_> (I want to see it) [04:26:26] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [04:28:46] _joe_: too late, ori merged it :) [04:35:46] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Sep 22 04:35:46 UTC 2015 (duration 35m 45s) [04:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:43:16] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:53:46] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [05:10:32] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#1662002 (10Dzahn) >>! In T113069#1661814, @bmansurov wrote: > @Ottomata, does my LDAP account need manual syncing? checked for group membership in "wmf" because that's a com... [05:23:17] (03CR) 10Dzahn: [C: 04-1] "personally i would like to keep bastiononly as a group for flexibility. i remember explicitly adding it because we had access requests for" [puppet] - 10https://gerrit.wikimedia.org/r/227327 (owner: 10Alex Monk) [05:26:23] (03PS1) 10Giuseppe Lavagetto: jobrunner/videoscaler: raise max_execution_time to 20 minutes [puppet] - 10https://gerrit.wikimedia.org/r/240033 [05:32:55] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner/videoscaler: raise max_execution_time to 20 minutes [puppet] - 10https://gerrit.wikimedia.org/r/240033 (owner: 10Giuseppe Lavagetto) [05:36:35] (03PS1) 10Dzahn: mira: remove inclusion of releases::upload [puppet] - 10https://gerrit.wikimedia.org/r/240034 [05:37:46] (03CR) 10Dzahn: deployment::server: move releases::upload into role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223464 (owner: 10Dzahn) [05:38:37] <_joe_> mutante: did we fully migrated to fermium then? [05:39:35] (03CR) 10Dzahn: "thanks for the clarification @Kevinator" [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [05:40:50] _joe_: yes, sodium is removed from puppet, though not shut down yet [05:40:58] changes by faidon to remove lucid support are merged [05:41:05] <_joe_> oh, nice! [05:41:11] <_joe_> that was my next question [05:41:15] :) [05:41:27] i merged it, and also the one for git-core -> git package rename [05:42:09] it caused an a duplicate definition with tool labs puppet code but has been fixed [05:44:05] there is clean up task to kill 3 lucid instances in labs or so, but they have been broken a long time [05:47:11] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1662017 (10JUnikowski_WMF) Hi Dzahn, Sure. I actually did that prior to creating the task ("You signed this document on Mon, Sep 21, 9:28 PM.") Please let me know if something didn'... [05:50:25] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1662023 (10Dzahn) @JUnikowski_WMF no, you did it right and i did not see it :) confirmed your signature. all good. Now we'll just need your manager to add an approval. [05:50:27] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [06:09:26] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 20.69% of data above the critical threshold [100000000.0] [06:30:07] PROBLEM - puppet last run on mw2052 is CRITICAL: CRITICAL: puppet fail [06:30:17] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:45] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:07] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:56] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:05] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:26] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:26] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:46] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:57] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:16] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:56] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:40:36] RECOVERY - Disk space on mw1015 is OK: DISK OK [06:44:32] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1662071 (10jcrespo) Data migration done. I would recommend starting testing the compression on a slave on codfw to avoid production impact. [06:55:55] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:56:16] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:17] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:56:37] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:37] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:57:15] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:26] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:35] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:37] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:57:37] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:55] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:56] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:59:26] RECOVERY - puppet last run on mw2052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:10] (03PS2) 10Muehlenhoff: Enable ferm for role::mariadb::analytics [puppet] - 10https://gerrit.wikimedia.org/r/235444 [07:08:10] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm for role::mariadb::analytics [puppet] - 10https://gerrit.wikimedia.org/r/235444 (owner: 10Muehlenhoff) [07:09:15] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [07:19:20] (03PS1) 10Muehlenhoff: Add ferm rules for role::mariadb::misc::eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/240042 (https://phabricator.wikimedia.org/T104699) [07:19:22] (03PS1) 10Muehlenhoff: Enable ferm on db1046 [puppet] - 10https://gerrit.wikimedia.org/r/240043 [07:33:52] (03PS2) 10Muehlenhoff: Enable ferm on snapshot1001 [puppet] - 10https://gerrit.wikimedia.org/r/237615 (https://phabricator.wikimedia.org/T104991) [07:36:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on snapshot1001 [puppet] - 10https://gerrit.wikimedia.org/r/237615 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [07:40:36] (03CR) 10MarcoAurelio: [C: 04-1] "I mantain my opinion that there's not enough consensus for this change, nor a need, nor a sound rationale besides 'because I want it'." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239854 (https://phabricator.wikimedia.org/T111898) (owner: 10Mdann52) [07:50:00] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, not feeling strongly about the passing/no passing of logger discussion" [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [07:51:26] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 5 below the confidence bounds [07:56:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 5 below the confidence bounds [07:56:56] (03CR) 10Filippo Giunchedi: [C: 031] "nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [07:59:26] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [08:00:03] (03CR) 10Filippo Giunchedi: [C: 031] Move base::firewall include into the roles [puppet] - 10https://gerrit.wikimedia.org/r/239847 (owner: 10Muehlenhoff) [08:05:17] 6operations, 6Commons: delete gwtoolset job by Hansmuller - https://phabricator.wikimedia.org/T112878#1662289 (10Steinsplitter) >>! In T112878#1652997, @Hansmuller wrote: > I hope this incident will not affect my ability to upload speedily. No worry. Thanks for your volunteer work! It is appreciated. Please l... [08:06:13] (03CR) 10Filippo Giunchedi: [C: 04-1] Add an Analytics specific instance of RESTBase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [08:06:37] (03PS1) 10Muehlenhoff: Enable ferm on remaining snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/240048 (https://phabricator.wikimedia.org/T104991) [08:06:48] (03PS1) 10MarcoAurelio: [Security] Restrict course page editing for any wiki with EducationProgram Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240049 (https://phabricator.wikimedia.org/T112806) [08:11:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 8 below the confidence bounds [08:13:34] (03CR) 10ArielGlenn: [C: 031] Enable ferm on remaining snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/240048 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [08:14:53] (03CR) 10Mormegil: [C: 031] [Security] Restrict course page editing for any wiki with EducationProgram Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240049 (https://phabricator.wikimedia.org/T112806) (owner: 10MarcoAurelio) [08:16:17] (03CR) 10MarcoAurelio: "Awhight tells that it'd be better to test this first on the beta cluster and if it works there, then merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240049 (https://phabricator.wikimedia.org/T112806) (owner: 10MarcoAurelio) [08:18:05] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:19:59] (03CR) 10Glaisher: [C: 04-1] "This should be in if ( $wmgUseEducationProgram ) block in CommonSettings.php as NS_EP is not defined on wikis where the extension is not e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240049 (https://phabricator.wikimedia.org/T112806) (owner: 10MarcoAurelio) [08:20:46] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 8 below the confidence bounds [08:24:34] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on remaining snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/240048 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [08:27:41] (03PS1) 10MarcoAurelio: Change default AbuseFilter IP block duration to not indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240053 (https://phabricator.wikimedia.org/T113164) [08:30:59] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 9 below the confidence bounds [08:35:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 9 below the confidence bounds [08:36:21] (03CR) 10Mobrovac: [C: 04-1] Add an Analytics specific instance of RESTBase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [08:40:15] 6operations, 5Patch-For-Review: Add Ferm rules for snapshot hosts - https://phabricator.wikimedia.org/T104991#1662334 (10MoritzMuehlenhoff) 5Open>3Resolved snapshot*, dataset1001, francium and ms1001 are now all using ferm. [08:40:45] 6operations, 5Continuous-Integration-Scaling, 7Database, 5Patch-For-Review: MySQL database for Nodepool - https://phabricator.wikimedia.org/T110693#1662337 (10hashar) >>! In T110693#1593380, @jcrespo wrote: > > BTW, the `FLUSH PRIVILEGES;` of the Openstack documentation is a bug: http://dbahire.com/stop-u... [08:45:12] (03CR) 10Alexandros Kosiaris: Add an Analytics specific instance of RESTBase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [08:48:43] (03PS2) 10Muehlenhoff: Move base::firewall include into the roles [puppet] - 10https://gerrit.wikimedia.org/r/239847 [08:50:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [08:52:07] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:55:16] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move base::firewall include into the roles [puppet] - 10https://gerrit.wikimedia.org/r/239847 (owner: 10Muehlenhoff) [09:12:20] (03PS1) 10Muehlenhoff: Enable ferm on db2010/db2030 [puppet] - 10https://gerrit.wikimedia.org/r/240055 [09:13:43] (03PS1) 10Muehlenhoff: Enable ferm on db1016 [puppet] - 10https://gerrit.wikimedia.org/r/240056 [09:13:45] (03PS1) 10Muehlenhoff: Enable ferm on db1020 [puppet] - 10https://gerrit.wikimedia.org/r/240057 [09:24:00] 6operations: Add ferm rules for eventlog hosts - https://phabricator.wikimedia.org/T113343#1662377 (10MoritzMuehlenhoff) [09:25:07] 6operations: Ferm rules for palladium - https://phabricator.wikimedia.org/T113344#1662384 (10MoritzMuehlenhoff) 3NEW [09:25:32] 6operations: Ferm rules for palladium - https://phabricator.wikimedia.org/T113344#1662392 (10MoritzMuehlenhoff) [09:28:10] (03PS1) 10Filippo Giunchedi: cassandra: add codfw production hosts [puppet] - 10https://gerrit.wikimedia.org/r/240060 (https://phabricator.wikimedia.org/T108613) [09:47:14] (03PS1) 10Muehlenhoff: Enable ferm on oxygen [puppet] - 10https://gerrit.wikimedia.org/r/240061 (https://phabricator.wikimedia.org/T83597) [09:51:12] (03PS1) 10Muehlenhoff: Enable ferm on mw1259 [puppet] - 10https://gerrit.wikimedia.org/r/240062 [09:51:14] (03PS1) 10Muehlenhoff: Enable ferm on mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/240063 [09:56:14] (03PS2) 10Muehlenhoff: Enable ferm on mw1259 [puppet] - 10https://gerrit.wikimedia.org/r/240062 [09:57:19] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1259 [puppet] - 10https://gerrit.wikimedia.org/r/240062 (owner: 10Muehlenhoff) [10:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150922T1000). Please do the needful. [10:03:03] !log enabled ferm on mw1259 (videoscaler) [10:03:07] RECOVERY - Disk space on restbase2003 is OK: DISK OK [10:03:08] RECOVERY - Disk space on restbase2002 is OK: DISK OK [10:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:03:38] RECOVERY - Disk space on restbase2001 is OK: DISK OK [10:03:54] !log finished stressdisk on restbase200[123] no errors reported [10:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:06:58] (03PS1) 10Mdann52: Tidy robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 [10:07:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [10:15:31] (03PS2) 10Muehlenhoff: Enable ferm on mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/240063 [10:16:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/240063 (owner: 10Muehlenhoff) [10:16:38] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [10:17:14] !log enabled ferm on mw1152 (videoscaler) [10:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:23:39] (03CR) 10Mobrovac: [C: 031] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/240060 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [10:26:52] (03PS1) 10Aude: Enable data access for Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240066 (https://phabricator.wikimedia.org/T107999) [10:30:41] * aude deploying [10:31:56] (03CR) 10Aude: [C: 032] Enable data access for Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240066 (https://phabricator.wikimedia.org/T107999) (owner: 10Aude) [10:32:03] (03Merged) 10jenkins-bot: Enable data access for Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240066 (https://phabricator.wikimedia.org/T107999) (owner: 10Aude) [10:35:52] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [10:35:57] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable data access for Wikibooks (duration: 01m 12s) [10:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:36:31] are the snapshot hosts broken? [10:37:56] aude: we enabled firewall rules on them this morning, let me check [10:38:59] ok [10:39:13] i get stuff like connect to host snapshot1004.eqiad.wmnet port 22: Connection timed out [10:39:20] aude: I think I know what the problem [10:39:22] even though i can login to them [10:39:24] k [10:40:25] I'll disable ferm for now (and will re-enabled once the new updated rules is in place), so that you can proceed with your deployment, give me a minute [10:40:32] ok [10:41:13] * aude also needs to look into why our thing in swat last night was not deployed on wmf23 [10:41:19] it should work again, sorry for the disturbance [10:41:23] ok [10:41:32] will fixup the missing rule later on [10:41:36] i'll sync the file again and finish the rest [10:42:03] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable data access for Wikibooks - try again for snapshot hosts (duration: 00m 12s) [10:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:42:09] its good [10:42:42] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Puppet has 1 failures [10:42:44] !log aude@tin Synchronized arbitraryaccess.dblist: Enable arbitrary access for Wikibooks (duration: 00m 12s) [10:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:46:14] wonder why a submodule update didn't get produced for core [10:57:28] (03PS1) 10Muehlenhoff: snapshots: Allow SSH from deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/240070 [11:01:19] * aude waits for jenkins [11:01:20] (03PS2) 10Alex Monk: Tidy robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [11:05:24] zzzzzzzzzz [11:05:44] what did you update that didn't trigger an automatic submodule update? [11:06:05] my swat thing last night [11:06:22] it (wmf22 wikidata) should be a submodule of wmf23 core [11:07:19] it say branch = wmf/1.26wmf23 [11:07:21] which is wrong [11:07:24] * aude can fix [11:08:19] what was on tin was correct [11:10:12] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:13:58] * aude thinks .gitmodules is wrong for other special extensions like CentralNotice [11:15:03] and somewhat concerned it's not the right thing deployed :/ [11:15:38] * aude proceeds to deploy the wikibase things [11:18:18] !log aude@tin Synchronized php-1.26wmf23/extensions/Wikidata: Fix autocomment and change handling bugs (duration: 00m 21s) [11:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:18:54] (03CR) 10Gilles: "I think it's time to drop "Basic" from the commit message :)" [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [11:19:35] central notice looks correct [11:20:02] i think everything is probably correct but .gitmodules is wrong (and think twentyafterfour explained this to before) [11:20:22] anyway done [11:20:32] ... [11:21:20] twentyafterfour: that .gitmodules says tracking wmf/1.26wmf23 branch of every extension [11:21:25] even the special ones [11:21:46] and somehow (is that related) that the automatic submodule updates during swat didn't get produced (for wikidata) [11:22:09] (03CR) 10GWicke: "LGTM to me as well." [puppet] - 10https://gerrit.wikimedia.org/r/240060 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [11:22:16] hmm.. I don't think that's intentional [11:22:22] oh [11:22:49] i think the right versions are on tin [11:22:56] (03CR) 10GWicke: "s/to me//" [puppet] - 10https://gerrit.wikimedia.org/r/240060 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [11:23:59] one thing is that tin has an old version of git and it doesn't properly respect the branch= setting [11:24:06] * aude nods [11:24:26] tin having old versions of everything really sucks [11:24:35] :( [11:24:44] tin should have hhvm etc [11:24:46] it means that scap is stuck in python 2.7 [11:25:08] command line php binary is 5.3 [11:25:12] :( [11:25:35] lots of old versions everywhere [11:26:18] (03CR) 10Mobrovac: "Sounds sensible to me. The implicit limit of 9 instances per node won't present a real problem." [puppet] - 10https://gerrit.wikimedia.org/r/240060 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [11:26:27] but why was there no automatic git submodule update for wikidata? [11:26:38] because .gitmodules was wrong? [11:27:08] aude: probably [11:27:09] * aude slightly annoyed at staying up until 1-2am for swat only for this not to be correct [11:27:21] and too tired to notice so [11:27:54] aude: I will try to make sure that it is right with the next branch [11:28:00] ok, thanks :) [11:28:01] if there is a bug in the script I'll fix it [11:28:22] now i know at least for wmf23, in case we need to put something else in swat [11:28:32] note that we are getting ready to switch to semver, next week I believe [11:28:37] \o/ [11:29:02] * aude off to get some food [11:29:06] :) [11:29:19] thanks for helping look at this [11:29:28] you're welcome [11:39:52] (03PS4) 10Phuedx: Replicate browser test config for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) (owner: 10Jdlrobson) [11:45:42] 6operations: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1662592 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [11:46:04] 6operations: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1662601 (10MoritzMuehlenhoff) [11:49:10] moritzm, releases::upload is being moved around a bit btw - https://gerrit.wikimedia.org/r/#/c/240034/ [11:50:00] (03PS1) 10Muehlenhoff: Add ferm rules for rsyncd/scap master [puppet] - 10https://gerrit.wikimedia.org/r/240074 (https://phabricator.wikimedia.org/T113351) [11:50:55] Krenair: thanks for the pointer [12:16:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 7 below the confidence bounds [12:19:15] (03PS1) 10Giuseppe Lavagetto: videoscaler: raise connection_timeout_seconds as well [puppet] - 10https://gerrit.wikimedia.org/r/240075 [12:22:21] Krenair: your expanddblist change has broken all cronjobs [12:22:29] all foreachwiki cronjobs, to be exact [12:22:37] Subject: Cron /usr/local/bin/foreachwiki extensions/SecurePoll/cli/purgePrivateVoteData.php 2>&1 > /dev/null [12:22:40] /usr/local/bin/foreachwikiindblist: line 4: expanddblist: command not found [12:24:17] what... [12:24:27] that command works for me [12:25:14] does it need /usr/local/bin/? [12:25:17] probably [12:25:35] probably $PATH as run from cron doesn't have it [12:25:49] ugh [12:27:06] yeah the default path is just /usr/bin:/bin [12:27:58] so either fully qualify it, or change PATH in /etc/profile.d/mediawiki.sh or something [12:30:37] (03PS1) 10Alex Monk: Fully qualify expanddblist path in foreachwikiindblist [puppet] - 10https://gerrit.wikimedia.org/r/240078 [12:41:37] (03CR) 10GWicke: "> {node}-{instance}" [puppet] - 10https://gerrit.wikimedia.org/r/240060 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [12:49:27] (03PS1) 10Sbisson: Re-enable Flow on Flow_test_talk on beta (en and ca) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240079 [12:55:14] (03CR) 10Faidon Liambotis: [C: 032] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/240078 (owner: 10Alex Monk) [12:55:16] (03CR) 10Filippo Giunchedi: "yup, what Marko mentioned, see also an example at https://gerrit.wikimedia.org/r/#/c/234292/1/hieradata/regex.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/240060 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [12:56:32] PROBLEM - puppet last run on mc2004 is CRITICAL: CRITICAL: puppet fail [12:56:45] (03PS1) 10Muehlenhoff: Enable ferm on tin [puppet] - 10https://gerrit.wikimedia.org/r/240083 [12:57:20] (03CR) 10GWicke: [C: 031] cassandra: add codfw production hosts [puppet] - 10https://gerrit.wikimedia.org/r/240060 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [13:04:32] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 1 below the confidence bounds [13:10:27] (03PS2) 10Muehlenhoff: snapshots: Allow SSH from deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/240070 [13:16:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] snapshots: Allow SSH from deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/240070 (owner: 10Muehlenhoff) [13:24:12] RECOVERY - puppet last run on mc2004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:28:30] (03PS1) 10Cmjohnson: DNS updates for elastic1005 and elastic1030 [dns] - 10https://gerrit.wikimedia.org/r/240086 [13:29:22] (03CR) 10Ottomata: [C: 031] Enable ferm on oxygen [puppet] - 10https://gerrit.wikimedia.org/r/240061 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [13:31:34] chasemp: ^^ when we're ready...we may want to make the etc/network/interfaces change prior to shutting down [13:32:31] Understood. Thanks chris [13:32:32] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#1662697 (10Ottomata) Hue does need manual syncing. Done. [13:35:18] (03PS1) 10Faidon Liambotis: Switch mail smarthosts to mx1001/mx2001 [puppet] - 10https://gerrit.wikimedia.org/r/240087 [13:36:00] (03CR) 10jenkins-bot: [V: 04-1] Switch mail smarthosts to mx1001/mx2001 [puppet] - 10https://gerrit.wikimedia.org/r/240087 (owner: 10Faidon Liambotis) [13:36:43] (03PS1) 10Filippo Giunchedi: restbase: add LVS codfw configuration [puppet] - 10https://gerrit.wikimedia.org/r/240088 (https://phabricator.wikimedia.org/T108613) [13:39:12] (03PS2) 10Giuseppe Lavagetto: videoscaler: raise connection_timeout_seconds as well [puppet] - 10https://gerrit.wikimedia.org/r/240075 [13:39:27] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] videoscaler: raise connection_timeout_seconds as well [puppet] - 10https://gerrit.wikimedia.org/r/240075 (owner: 10Giuseppe Lavagetto) [13:41:31] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [13:52:26] (03CR) 10BBlack: [C: 04-1] "This also needs the service IP added to the codfw lists (matching similar in eqiad) in $lvs_balancer_ips in modules/role/manifests/lvs/bal" [puppet] - 10https://gerrit.wikimedia.org/r/240088 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [13:54:21] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [13:59:28] 6operations, 6Performance-Team, 10Traffic, 7Performance: enwiki Main_Page timeouts - https://phabricator.wikimedia.org/T104225#1662783 (10BBlack) For PyBal, we switched monitoring to `Special:BlankPage`, so this doesn't depool servers anymore. So now it has even less visibility from ops' perspective, but... [14:02:18] 6operations, 10Traffic: Fix Varnish TTLs across the board - https://phabricator.wikimedia.org/T108612#1662790 (10BBlack) 5Open>3Resolved a:3BBlack [14:02:29] 6operations, 10Traffic: Fix Varnish TTLs across the board - https://phabricator.wikimedia.org/T108612#1524905 (10BBlack) p:5High>3Unbreak! [14:02:55] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1662795 (10BBlack) 5Open>3Resolved [14:03:25] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#766158 (10BBlack) p:5High>3Triage [14:03:57] (03PS2) 10Muehlenhoff: Enable ferm on oxygen [puppet] - 10https://gerrit.wikimedia.org/r/240061 (https://phabricator.wikimedia.org/T83597) [14:04:19] much hate for phab resetting priorities on drag-n-drop between columns just because they're priority-sorted :P [14:04:40] it is truly an abomination [14:05:53] especially as one is not even able to see the whole workboard [14:11:29] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1662803 (10chasemp) https://gerrit.wikimedia.org/r/#/c/240086/ [14:12:02] (03PS2) 10Faidon Liambotis: Switch mail smarthosts to mx1001/mx2001 [puppet] - 10https://gerrit.wikimedia.org/r/240087 [14:12:40] (03CR) 10Faidon Liambotis: [C: 032] Switch mail smarthosts to mx1001/mx2001 [puppet] - 10https://gerrit.wikimedia.org/r/240087 (owner: 10Faidon Liambotis) [14:12:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on oxygen [puppet] - 10https://gerrit.wikimedia.org/r/240061 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [14:13:33] (03PS3) 10Faidon Liambotis: Switch mail smarthosts to mx1001/mx2001 [puppet] - 10https://gerrit.wikimedia.org/r/240087 [14:13:38] (03CR) 10Faidon Liambotis: [V: 032] Switch mail smarthosts to mx1001/mx2001 [puppet] - 10https://gerrit.wikimedia.org/r/240087 (owner: 10Faidon Liambotis) [14:14:17] (03CR) 10Giuseppe Lavagetto: "I created this change when I added a command-line option to run in foreground, before a discussion with mark where we decided to remove it" [debs/pybal] - 10https://gerrit.wikimedia.org/r/239390 (owner: 10Giuseppe Lavagetto) [14:14:19] !log depool elastic nodes for T112559 [14:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:04] PROBLEM - puppet last run on mw2096 is CRITICAL: CRITICAL: puppet fail [14:19:44] !log starting slow restart of varnish + varnish-frontend daemon processes on global text, upload, and mobile clusters for shm_reclen (all randomly blended, no parallelism, ~5 minute spacing, will take ~9 hours - FEs will lose cache data, BEs will not) [14:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:36:44] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [14:38:57] ESC[mNotice: /Stage[main]/Tendril/Git::Clone[operations/software/tendril]/Exec[git_pull_operations/software/tendril]/returns: error: The requested URL returned error: 503 while accessing https://gerrit.wikimedia.org/r/p/operations/software/tendril.git/info/refsESC[0m [14:39:03] ^ on neon [14:42:09] !log shutting down elastic1005 and elastic1030 to move around within the data center [14:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:43:34] PROBLEM - Host elastic1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:27] oops forget to disable icinga checks ...done for elastic1030 [14:44:52] yes sorry I'm on it [14:46:07] krrrit-wm: ?? [14:46:29] <_joe_> paravoid: k for kubernetes, I guess [14:46:36] I know that [14:46:45] I just pushed changesets and they didn't appear [14:46:50] <_joe_> uhm [14:48:43] RECOVERY - puppet last run on mw2096 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:49:06] <_joe_> paravoid: as far as kubernetes is concerned, it is running fine, FWIW [14:49:23] <_joe_> I have no idea how it works besides that :P [14:52:07] 6operations, 5Patch-For-Review: Tweak sysctl settings for nf_conntrack - https://phabricator.wikimedia.org/T105307#1662862 (10MoritzMuehlenhoff) 5Resolved>3Open Another change worth considering is to lower the connection tracking timeout for connections in TIME_WAIT status. The initial job runner which ha... [14:52:31] 6operations, 5Patch-For-Review: Tweak sysctl settings for nf_conntrack - https://phabricator.wikimedia.org/T105307#1662864 (10MoritzMuehlenhoff) [14:52:44] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150922T1500). [15:02:33] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:11:23] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:16:23] RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:18:26] <_joe_> jzerebecki: I have a comment on https://gerrit.wikimedia.org/r/#/c/239367/ [15:18:35] <_joe_> since krrrit-wm seems not to behave [15:18:40] <_joe_> I could restart it [15:24:12] Krenair: for you: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=184176&oldid=184175 [15:27:40] greg-g: I've added some dynamic markings to Deployments page yesterday. Somewhat experimental, let me know if you hear or experience any issues. [15:28:41] Krinkle: where? I don't see you editing it yesterday in the history [15:28:50] It's not on the page. [15:28:52] It's magic [15:28:59] It acts on the page via javascript [15:29:44] Past events are dimmed, and any events currrenty active (or within 5 min) are marked yellow [15:29:52] magic? [15:30:00] it stays up to date on open tabs as well, so no refresh needed [15:30:07] I don't see it [15:30:31] Hmm... [15:30:40] Did you refresh the page at least once since 18 hours ago? [15:30:56] Should look like this, http://i.imgur.com/5fUwtP3.png [15:31:20] I just did [15:31:26] hey #ops - you'd be proud of me. I made elastic support jessie as a first class citizen. [15:31:37] that's cool looking [15:31:38] <_joe_> manybubbles: \o/ we are! [15:31:46] manybubbles: it's a nik! [15:31:50] RECOVERY - Host elastic1005 is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [15:31:54] greg-g: I never really left [15:31:57] (03CR) 10Ottomata: [C: 032] Set umask to 0002 for wikidev users on stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/240114 (https://phabricator.wikimedia.org/T111956) (owner: 10Ottomata) [15:32:04] manybubbles: :) still miss ya though, man [15:32:10] <_joe_> ^^ [15:32:11] <_joe_> that! [15:32:30] _joe_: hurray! elasticsearch now has tests for 10 flavors of linux. suse is annoying [15:32:39] I miss you all too [15:32:51] don't tell chris that suse is annoying [15:32:58] <_joe_> :P [15:33:07] <_joe_> anything ! Debian is annoying, really [15:33:17] computers sure are [15:33:18] <_joe_> how come people still use anything else [15:33:32] <_joe_> greg-g: it's not computers, it's the software [15:33:40] <_joe_> and who writes the software? DEVELOPERS [15:33:43] * greg-g shrugs [15:33:46] lol - I find centos-7 to be pretty ok actually. [15:33:49] <_joe_> => developers are annoying [15:33:51] <_joe_> :) [15:33:53] and yes, all computers are annoying [15:34:40] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [15:35:01] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/240060 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [15:36:33] 6operations, 10Traffic: Upgrade Pybal to 1.08 - https://phabricator.wikimedia.org/T110954#1662966 (10Joe) 5Open>3Resolved [15:36:35] manybubbles: \o/ good job -- and yes we miss you [15:36:53] <_joe_> chasemp: ^^ am I correct considering this done? [15:36:53] (03PS1) 10Alexandros Kosiaris: Backup home_pmtpa on bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/240115 (https://phabricator.wikimedia.org/T113265) [15:37:08] <_joe_> or the ticket was about upgrading the actual LVS hosts to a new version? [15:37:22] I don't see wikibugs so I'm not sure what task [15:37:23] 6operations, 10Traffic: Upgrade Pybal to 1.08 - https://phabricator.wikimedia.org/T110954#1590827 (10Joe) 5Resolved>3Open [15:37:36] (03CR) 10BBlack: [C: 031] "LGTM, note comment re SIGINT (not critical, could do it later in a followup patch)" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/239391 (owner: 10Giuseppe Lavagetto) [15:37:48] <_joe_> uhm I had a fluke I didn't see the description well. [15:37:50] (03PS2) 10JanZerebecki: Make link in dataset relative [puppet] - 10https://gerrit.wikimedia.org/r/239367 (https://phabricator.wikimedia.org/T112892) [15:37:59] (03CR) 10BBlack: [C: 031] Remove LogFile [debs/pybal] - 10https://gerrit.wikimedia.org/r/239392 (owner: 10Giuseppe Lavagetto) [15:38:29] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/239367 (https://phabricator.wikimedia.org/T112892) (owner: 10JanZerebecki) [15:38:44] (03CR) 10JanZerebecki: Make link in dataset relative (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/239367 (https://phabricator.wikimedia.org/T112892) (owner: 10JanZerebecki) [15:38:56] 6operations, 10Analytics-EventLogging, 7Graphite: Statsv down since 2015-09-20 07:53 - https://phabricator.wikimedia.org/T113315#1662979 (10Krinkle) >>! In T113315#1661927, @Ottomata wrote: > Also, this is statsv, right? statsv != eventlogging, so I'm not sure what Kafka code you are referring to. See http... [15:39:58] (03CR) 10BBlack: [C: 031] Add systemd support, remove sysvinit support [debs/pybal] - 10https://gerrit.wikimedia.org/r/239393 (owner: 10Giuseppe Lavagetto) [15:40:29] <_joe_> bblack: adding SIGINT now [15:40:46] (03PS2) 10Filippo Giunchedi: cassandra: add codfw production hosts [puppet] - 10https://gerrit.wikimedia.org/r/240060 (https://phabricator.wikimedia.org/T108613) [15:40:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add codfw production hosts [puppet] - 10https://gerrit.wikimedia.org/r/240060 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [15:41:24] 6operations, 10Analytics-EventLogging, 7Graphite: Statsv down since 2015-09-20 07:53 - https://phabricator.wikimedia.org/T113315#1662994 (10Ottomata) Aye, I don't know what encoding varnishkafka is sending, but it is C and I wouldn't be surprised if it was just ascii. But as far as I know, nothing has chang... [15:41:32] !log stop puppet on restbase2* pending codfw expansion [15:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:59] (03CR) 10BBlack: [C: 031] restbase: add LVS codfw configuration [puppet] - 10https://gerrit.wikimedia.org/r/240088 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [15:42:18] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Fix llama user id {hawk} [5 pts] - https://phabricator.wikimedia.org/T100678#1663000 (10Dzahn) moved to "Done" means it's resolved, right? [15:43:09] (03PS3) 10Giuseppe Lavagetto: Remove daemonization options [debs/pybal] - 10https://gerrit.wikimedia.org/r/239391 [15:43:28] (03CR) 10Giuseppe Lavagetto: Remove daemonization options (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/239391 (owner: 10Giuseppe Lavagetto) [15:44:31] (03CR) 10Dzahn: Manage llama user in puppet to work around package bug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/240111 (https://phabricator.wikimedia.org/T100678) (owner: 10Ottomata) [15:45:12] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Fix llama user id {hawk} [5 pts] - https://phabricator.wikimedia.org/T100678#1663010 (10Dzahn) shouldn't that user have "system=> true" though if the point is to create it as a system user? [15:46:20] !log running puppet on restbase2001 [15:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:57] (03PS2) 10Dzahn: Backup home_pmtpa on bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/240115 (https://phabricator.wikimedia.org/T113265) (owner: 10Alexandros Kosiaris) [15:48:08] (03CR) 10Dzahn: [C: 032] Backup home_pmtpa on bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/240115 (https://phabricator.wikimedia.org/T113265) (owner: 10Alexandros Kosiaris) [15:48:16] (03CR) 10Ottomata: Manage llama user in puppet to work around package bug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/240111 (https://phabricator.wikimedia.org/T100678) (owner: 10Ottomata) [15:51:48] (03CR) 10Alexandros Kosiaris: [C: 032] Mute OTRS cronspam [puppet] - 10https://gerrit.wikimedia.org/r/240113 (owner: 10Alexandros Kosiaris) [15:52:11] (03PS2) 10Alexandros Kosiaris: Mute OTRS cronspam [puppet] - 10https://gerrit.wikimedia.org/r/240113 [15:53:10] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:53:23] (03CR) 10Giuseppe Lavagetto: [C: 032] Remove daemonization options [debs/pybal] - 10https://gerrit.wikimedia.org/r/239391 (owner: 10Giuseppe Lavagetto) [15:54:53] (03PS3) 10Giuseppe Lavagetto: Remove LogFile [debs/pybal] - 10https://gerrit.wikimedia.org/r/239392 [15:55:02] (03CR) 10Giuseppe Lavagetto: [C: 032] Remove LogFile [debs/pybal] - 10https://gerrit.wikimedia.org/r/239392 (owner: 10Giuseppe Lavagetto) [15:55:17] (03Merged) 10jenkins-bot: Remove LogFile [debs/pybal] - 10https://gerrit.wikimedia.org/r/239392 (owner: 10Giuseppe Lavagetto) [15:56:05] (03CR) 10Dduvall: "The vast majority of these corrections appear to be the result of a bug in RuboCop where a comma is inserted after a final hash argument, " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:56:50] (03CR) 10Giuseppe Lavagetto: [C: 032] Add systemd support, remove sysvinit support [debs/pybal] - 10https://gerrit.wikimedia.org/r/239393 (owner: 10Giuseppe Lavagetto) [15:56:58] (03PS4) 10Giuseppe Lavagetto: Add systemd support, remove sysvinit support [debs/pybal] - 10https://gerrit.wikimedia.org/r/239393 [15:57:26] 6operations, 5Patch-For-Review: Tweak sysctl settings for nf_conntrack - https://phabricator.wikimedia.org/T105307#1663061 (10BBlack) It probably has something to do with being able to match up very-delayed final packets that come in for a connection that's well past its true TIME_WAIT, as opposed to consideri... [15:59:08] (03CR) 10Dduvall: "I should add, however, that applying the Puppet coding conventions to the Ruby implementation does seem incongruent with how we apply stan" [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [16:00:04] RobH bblack: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150922T1600). [16:00:04] irc-nickname Krenair jzerebecki: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:18] \o [16:00:25] (03PS1) 10Ottomata: Set system => true for llama user [puppet] - 10https://gerrit.wikimedia.org/r/240117 (https://phabricator.wikimedia.org/T100678) [16:01:33] hi [16:01:45] ok swat time [16:01:49] bblack: you about? [16:02:39] (if he isn't I'll simply deploy without him) [16:03:11] Krenair: thx for splitting up the restbase part from your patchset [16:04:07] !log disabling puppet across mw hosts for new configuration deployment [16:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:04:50] (03PS2) 10Ori.livneh: Use argparse instead of getopt [debs/pybal] - 10https://gerrit.wikimedia.org/r/239390 (owner: 10Giuseppe Lavagetto) [16:04:57] (03CR) 10Ori.livneh: [C: 032] Use argparse instead of getopt [debs/pybal] - 10https://gerrit.wikimedia.org/r/239390 (owner: 10Giuseppe Lavagetto) [16:05:07] Krenair: your patch is up first, im doing it now =] [16:05:13] (03Merged) 10jenkins-bot: Use argparse instead of getopt [debs/pybal] - 10https://gerrit.wikimedia.org/r/239390 (owner: 10Giuseppe Lavagetto) [16:05:15] PROBLEM - Cassanda CQL query interface on restbase2001 is CRITICAL: Connection refused [16:05:26] (03PS4) 10RobH: Redirect et.wikimedia.org to ee.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/239278 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [16:05:37] PROBLEM - Cassandra database on restbase2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [16:05:40] 6operations, 5Patch-For-Review: Tweak sysctl settings for nf_conntrack - https://phabricator.wikimedia.org/T105307#1663076 (10MoritzMuehlenhoff) > Maybe leave it a little longer than kernel TIME_WAIT though, as it's probably not ideal to have conntrack forget it before the TCP stack itself does. Maybe someth... [16:05:56] (03CR) 10Ottomata: [C: 032] Set system => true for llama user [puppet] - 10https://gerrit.wikimedia.org/r/240117 (https://phabricator.wikimedia.org/T100678) (owner: 10Ottomata) [16:05:57] robh: yes [16:06:17] PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: puppet fail [16:06:20] bblack: was just going to ask if you wanted to deploy or if i should, but ive started to deploy [16:06:22] (I kinda forgot the timewindow heh) [16:06:23] can you +1 on the patchset though? [16:06:24] ok [16:06:30] https://gerrit.wikimedia.org/r/#/c/239278/4 [16:06:47] PROBLEM - Restbase endpoints health on restbase2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:06:52] give me a sec to read [16:07:05] PROBLEM - Restbase root url on restbase2001 is CRITICAL: Connection refused [16:07:06] RECOVERY - Cassanda CQL query interface on restbase2001 is OK: TCP OK - 0.034 second response time on port 9042 [16:07:08] no worries, this is a simple apache redirect change (its even an existing redirection) [16:07:27] RECOVERY - Cassandra database on restbase2001 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [16:07:43] (03CR) 10BBlack: [C: 031] Redirect et.wikimedia.org to ee.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/239278 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [16:07:49] paravoid: are you around? [16:07:55] (03CR) 10RobH: [C: 032] Redirect et.wikimedia.org to ee.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/239278 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [16:08:02] goddamn already needs rebase arghghhh [16:08:07] RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:08:08] (03PS5) 10RobH: Redirect et.wikimedia.org to ee.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/239278 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [16:08:10] looking at the other 3 [16:08:18] cool, i'll keep merging the apache redireciotn change [16:08:43] (03CR) 10BBlack: [C: 031] Change docs and integration.m.o to rewrite [puppet] - 10https://gerrit.wikimedia.org/r/229426 (https://phabricator.wikimedia.org/T84060) (owner: 10JanZerebecki) [16:09:09] (03CR) 10BBlack: [C: 031] Make link in dataset relative [puppet] - 10https://gerrit.wikimedia.org/r/239367 (https://phabricator.wikimedia.org/T112892) (owner: 10JanZerebecki) [16:10:42] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Fix llama user id {hawk} [5 pts] - https://phabricator.wikimedia.org/T100678#1663123 (10kevinator) 5Open>3Resolved [16:11:10] (03CR) 10BBlack: [C: 031] Create real URIs for wikidata RDF URIs [puppet] - 10https://gerrit.wikimedia.org/r/230483 (https://phabricator.wikimedia.org/T97195) (owner: 10Smalyshev) [16:13:37] (03PS40) 10Ori.livneh: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [16:13:56] paravoid, why is wikimedia.org MX being switched separately? [16:14:10] just so less can break at the same time? [16:14:32] (03CR) 10Ori.livneh: [C: 032] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [16:14:47] !log re-enabling puppet on mw hosts, as the new patchset 239278 deployed and tested fine on a single host, deploying to rest [16:14:50] PS40 heh [16:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:15:02] is that a new record? [16:15:14] bblack: maybe for operations/puppet -- for mediawiki/core, not even close [16:15:33] bblack: so now we have to wonder how best to deploy the rest of the config changes [16:15:47] i can trigger puppet to run and get all the updates for the now workign et to ee.wikimedia.org redirect [16:15:53] or we can merge the other ones and update all at once... [16:15:59] (i'm not sure what the best answer is) [16:16:01] https://gerrit.wikimedia.org/r/#/c/220665/ <-- PS50 [16:16:10] for a new redirect, you can just merge it and wait [16:16:19] mutante: i hate waiting but yea [16:16:25] except now i have 3 more redireciton changes [16:16:28] I'd say for changes that aren't really "special" in some way: +2 merge, puppet-merge, run one host manually to verify, then let normal puppet timing do the rest [16:16:30] so do i wait for each one to fully deploy? [16:16:36] ok [16:16:43] i think that sounds sane as well but just discussing =] [16:16:59] ok, then one is done, next up [16:17:11] obviously if the change is horribly broken, we'd still get puppetfail spam that way by not pre-disabling, but pre-disabling should be rare for really complex/dangerous things, not puppetswat [16:17:12] bblack: did you want to deploy the next one? [16:17:18] robh: go for it [16:17:18] https://wikitech.wikimedia.org/wiki/Application_servers shows the walk through [16:17:20] ok [16:17:28] I think this might be the record: https://gerrit.wikimedia.org/r/#/c/135312/ -- PS72 [16:17:39] (03PS2) 10RobH: Change docs and integration.m.o to rewrite [puppet] - 10https://gerrit.wikimedia.org/r/229426 (https://phabricator.wikimedia.org/T84060) (owner: 10JanZerebecki) [16:17:53] jzerebecki: workign on your patches now =] [16:18:38] (03CR) 10RobH: [C: 032] Change docs and integration.m.o to rewrite [puppet] - 10https://gerrit.wikimedia.org/r/229426 (https://phabricator.wikimedia.org/T84060) (owner: 10JanZerebecki) [16:19:22] ok, merged and running manually on one to watch and confirm its ok [16:20:06] im putting puppet back to disabled on all the rest though im too paranoid to let them sit and autoupdated until i have one working [16:20:12] (ive put, wrong tense) [16:20:57] ok :) [16:21:36] (03CR) 10Chad: [C: 032] Stop executing on failure [tools/scap] - 10https://gerrit.wikimedia.org/r/239521 (owner: 10Thcipriani) [16:21:44] works for me either way. but philosophically - puppetswat should be simple changes with low risk of breakage. if we're regularly running into problems with puppetfail or breakage due to lack of pre-disable, I'd rather it be painful so that we change our puppetswat risk threshold accordingly. [16:21:51] (03Merged) 10jenkins-bot: Stop executing on failure [tools/scap] - 10https://gerrit.wikimedia.org/r/239521 (owner: 10Thcipriani) [16:23:12] robh, did it apply everywhere? [16:23:14] !log re-enabled puppet on mw hosts, as both redirection changes are good [16:23:18] not yet no [16:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:29] see backlog, we had multiple apache chagnes [16:23:42] ok [16:23:43] so i just rolled them each onto a single test system to confirm working and now just enabled puppet back on for all of them [16:23:53] but, now they will all start to call in and get the update =] [16:23:58] (some may already have it) [16:24:16] sorry for confusion [16:25:05] 6operations, 10ops-eqiad: ps1-a5 -eqiad power not balanced - https://phabricator.wikimedia.org/T111973#1663179 (10Cmjohnson) 5Open>3Resolved load is balanced [16:25:55] jzerebecki: you about? I am trying to track down exactly what hosts these will change on your dataset dir patchset [16:26:00] so i ensure it doesnt break them [16:26:15] robh: mom. [16:26:16] bblack: did you happen to see where they ran? [16:26:31] nah, babysitter, not mom ;] [16:26:40] im babysitting the process and being paid to do so =] [16:27:09] robh: dataset1001.wikimedia.org and possibly the secondary [16:27:12] (03PS3) 10RobH: Make link in dataset relative [puppet] - 10https://gerrit.wikimedia.org/r/239367 (https://phabricator.wikimedia.org/T112892) (owner: 10JanZerebecki) [16:27:32] cool, i was assumign but was still mid manifest trackdown. saves me trouble =] [16:27:35] robh: which would be ms1001.wikimedia.org [16:28:07] basically dataset::primary and dataset:secondary [16:28:23] (03CR) 10RobH: [C: 032] Make link in dataset relative [puppet] - 10https://gerrit.wikimedia.org/r/239367 (https://phabricator.wikimedia.org/T112892) (owner: 10JanZerebecki) [16:29:07] running now on dataset1001 [16:29:40] or was it that these get rsynced to dataset, *sigh* dumps infrastructure is so confusing [16:30:17] bblack: I am getting light on the fiber [16:30:36] from asw-a2 xe-0/0/13 [16:30:37] yea wasnt a change on dataset1001, so i'll go back to tracking [16:30:50] though now its merged so it may run on said hosts before i get to it, but it was also +1 by a lot of folks [16:30:55] likely ok, but im still trackign it down [16:31:32] no that is how it works [16:31:37] cmjohnson1: the host has no link, and the switch says: [16:31:42] bblack@asw2-a5-eqiad> show interfaces xe-0/0/13 [16:31:42] error: device xe-0/0/13 not found [16:31:54] but show chassis hardware shows: [16:31:55] Xcvr 13 NON-JNPR PKC3UQ9 SFP-LX10 [16:32:11] jzerebecki: it is and i've avoided learning it ;P [16:32:16] (11, 12, and 13 all show that same SFP-LX10, and they're the ports here for these new LVS connections) [16:32:21] but... its merged and apergos totally did a +1 the other day [16:32:28] with the note of 'if this breaks anything, it'll be wikidata' [16:32:30] =] [16:32:38] role::dataset::primary has class { 'dataset': which has include dataset::html which has include dataset::dirs [16:33:08] Krenair: yes [16:33:10] so the change should have been done on dataset1001 [16:33:13] ok [16:33:48] robh: ls -la /data/xmldatadumps/public/wikidatawiki/ [16:34:13] its populated [16:34:15] robh: should contain a symlink named entities to ../other/wikibase/wikidatawiki [16:34:20] it does [16:34:26] ok then its done [16:34:27] so yep [16:34:28] entities -> /data/xmldatadumps/public/other/wikibase/wikidatawiki [16:34:30] cool [16:34:38] robh: no that is not relative [16:34:54] that was the bug this patch should fix [16:35:03] so why does the task it links to says it changing stuff on stat1002? [16:35:07] https://phabricator.wikimedia.org/T112892 [16:35:15] so shouldnt that link change on stat1002? [16:35:23] (03PS2) 10Faidon Liambotis: Switch MX to mx1001/mx2001 (non-wikimedia.org) [dns] - 10https://gerrit.wikimedia.org/r/240102 [16:35:32] (03CR) 10Faidon Liambotis: [C: 032] Switch MX to mx1001/mx2001 (non-wikimedia.org) [dns] - 10https://gerrit.wikimedia.org/r/240102 (owner: 10Faidon Liambotis) [16:35:45] robh: because stat1002 mounts dataset via nfs [16:35:50] ahh [16:35:51] ok [16:36:02] as the mount point is different the symlink needs to be relative [16:36:30] is the symlink is now relative or absolute? [16:36:36] absolute [16:36:57] was what i pasted =[ [16:37:01] (03PS1) 10Faidon Liambotis: Switch the wiki-mail-eqiad service IP to mx1001 [puppet] - 10https://gerrit.wikimedia.org/r/240121 [16:37:12] mh so why didn't the puppet run change anything... [16:37:20] running again for kicks... [16:38:02] jzerebecki: Notice: /Stage[main]/Dataset::Dirs/File[/data/xmldatadumps/public/wikidatawiki/entities]/target: target changed '/data/xmldatadumps/public/other/wikibase/wikidatawiki' to '../other/wikibase/wikidatawiki' [16:38:03] ha [16:38:05] second run... [16:38:08] (03CR) 10Faidon Liambotis: [C: 032] Switch the wiki-mail-eqiad service IP to mx1001 [puppet] - 10https://gerrit.wikimedia.org/r/240121 (owner: 10Faidon Liambotis) [16:38:18] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1663225 (10Cmjohnson) Verified, connection and verified light coming from asw2 [16:38:21] :) good [16:38:32] ok, well, that solves that, going to babysit it on ms1001 as well [16:39:30] 6operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#1663229 (10Aklapper) [16:40:10] ok, i saw it update the directory on ms1001 as well, so that patchset is done [16:40:13] next and last patchset [16:41:04] (03PS5) 10RobH: Create real URIs for wikidata RDF URIs [puppet] - 10https://gerrit.wikimedia.org/r/230483 (https://phabricator.wikimedia.org/T97195) (owner: 10Smalyshev) [16:41:13] still going, robh? [16:41:19] on the last one on the page right now [16:42:15] (03CR) 10RobH: [C: 032] Create real URIs for wikidata RDF URIs [puppet] - 10https://gerrit.wikimedia.org/r/230483 (https://phabricator.wikimedia.org/T97195) (owner: 10Smalyshev) [16:44:24] 6operations, 5Patch-For-Review: integration.wikimedia.org redirect behavior is incorrect - https://phabricator.wikimedia.org/T84060#1663264 (10JanZerebecki) 5Open>3Resolved [16:46:03] ok, last change broke mw1224 [16:46:08] glad im paranoid and disabled puppet on the rest [16:46:18] (03CR) 10Faidon Liambotis: Manage llama user in puppet to work around package bug (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/240111 (https://phabricator.wikimedia.org/T100678) (owner: 10Ottomata) [16:46:22] ottomata: I was late to the party [16:46:32] robh: can you past? [16:46:35] +e [16:46:39] ottomata: (see above) especially the uid hardcoding is a bad idea [16:46:39] (03PS1) 10RobH: broke when applied to mw1224, reverting Revert "Create real URIs for wikidata RDF URIs" [puppet] - 10https://gerrit.wikimedia.org/r/240123 [16:46:56] apache2: Syntax error on line 79 of /etc/apache2/apache2.conf: Syntax error on line 95 of /etc/apache2/sites-enabled/03-main.conf: Could not open configuration file /etc/apache2/sites-enabled/wikidata-uris.incl: No such file or directory [16:47:10] still troubleshooting, but got the revert in if we dont figure it out in a swift fashion [16:47:16] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1663299 (10BBlack) re the asw2-a5:13 <-> lvs1012:eth1 problem, this is what I'm currently seeing on the switch (host has no link): ``` bblack@asw2-a5-eqiad> show... [16:47:22] its calling a file puppet should have installed on that run [16:47:24] rerunning [16:47:57] oh, its the sites-enabled symlink that doesnt locate [16:48:28] So how are the other symlinks generated? [16:48:48] heh, another puppet run fixes... [16:48:48] lame [16:49:25] so seems to function now, just testing a few calls for regular use against the test server [16:49:27] paravoid: the package installs it with /bin/bash [16:49:32] it might not need it. [16:49:48] is system => true enough to not need to hardcode the uid? [16:49:53] jzerebecki: darn no ordering to puppet run applies! so it just happened to fire the test before it installed the file. [16:50:18] outch so the modules/mediawiki stuff is missing dependencies [16:50:36] so, im going to manually fire them all a couple of times just out of paranoia [16:50:46] but, swat window is now over otherwise, all patches applied [16:50:55] !log all mw servers returned to puppet enabled, puppet swat window over [16:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:00] that way if i see the apaches fail for them during my runs they will simply fire again [16:53:41] bblack: see the above, just in case there is a hiccup we should both know whats up [16:53:46] since we are on swat for the patchset [16:54:09] summary: puppet can fire the apache test BEFORE it actually installs the sites-enabled sym-link [16:54:25] so then puppet has to rerun to fire test effectively [16:55:05] (03PS11) 10Chad: A context manager for managing nested loggers [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [16:55:07] im watching them batch thorugh in very small amounts (i broke down salt to the groups of 100 systems, and then firing them off via 25% batching) [16:55:08] (03PS1) 10Chad: Use context logger and stop passing one to sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/240126 [16:55:22] (03CR) 10jenkins-bot: [V: 04-1] Use context logger and stop passing one to sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/240126 (owner: 10Chad) [16:55:35] robh: ok [16:55:42] (03CR) 10Chad: "PS10 was just a rebase" [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [16:55:51] I'm now retroactively worried that i made the wrong call. [16:56:05] * robh is going to be worried about this for the next hour until he sees all mw systems have called in. [16:56:19] heh, scap came back [16:56:54] robh: exec { 'apache2_test_config_and_restart': before => Service['apache2'], but all the modules/mediawiki stuff only has before => Service['apache2']; so that way it can actually try to restart before restarting the service [16:56:55] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:57:06] (or maybe krrrit-wm1 just wasn't reloaded with the new config?) [16:57:18] eww [16:57:30] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1663342 (10BBlack) @faidon says this is because the interface exists as ge-0/0/13, because these SFPs in ports 11, 12, 13 are 1Gb, should be 10Gb [16:57:39] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1663345 (10Krenair) 5Open>3Resolved [16:57:41] 6operations, 10Wikimedia-Site-Requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1663346 (10Krenair) [16:58:09] (03PS2) 10Chad: Use context logger and stop passing one to sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/240126 [17:00:37] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.414 second response time [17:00:48] fuck... [17:01:08] so im not convinced i havent made a horrible mistake [17:01:16] i now think perhaps i should have reverted that last one, not pushed [17:01:31] as now i fear it may fire puppet failures across the mw* cluster and then have them all have to recall in to get the update [17:02:06] PROBLEM - Cassandra database on restbase2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [17:02:06] jzerebecki: what file calls that in the mediawiki module? [17:02:23] ^ is me [17:02:27] maybe we should apply that fix so mw systems that havent called in can get it... [17:04:03] robh: we could add before => Exec['apache2_test_config_and_restart'] to everything in ::mediawiki that currently has before => Service['apache2'] but the important part for this run is only in modules/mediawiki/manifests/web/sites.pp [17:04:50] robh: should I prepare a patch? [17:05:01] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1663376 (10fgiunchedi) [17:05:04] 6operations, 10RESTBase, 10RESTBase-Cassandra: rename cassandra cluster - https://phabricator.wikimedia.org/T112257#1663370 (10fgiunchedi) 5Resolved>3Open p:5High>3Normal [17:05:04] if you dont i shall and i'll just be doing what you said =] [17:05:11] so feel free and i'll snag it [17:05:23] greg-g: what do you mean by 'scap came back' [17:05:34] i should have asked for this before i reenabled puppet on the rest, i wasnt being paranoid enough [17:05:56] so far we seem to be ok (site isnt complaining) but my stress level went from nothing to horrible =P [17:06:48] jzerebecki: i may not force the patchset in right this second if hte site isnt crashing, but i dont want this to happen again and i think it should ensure the same way in both modules [17:07:02] thank you for finding out they didnt [17:07:46] ok, i'm seeing the file show up now on multiple apache servers without crashing/manual intervention [17:07:48] so i think we're ok. [17:08:16] the puppetmaster is still overloaded. [17:08:35] https://www.youtube.com/watch?v=sPLEbAVjiLA :) [17:09:17] <_joe_> jzerebecki: don't do that [17:09:27] <_joe_> jzerebecki: that would be horrible [17:09:55] _joe_: shouldnt the mediawiki module call the test, not the service refresh frist? [17:10:03] <_joe_> no [17:10:16] <_joe_> refreshing the service does exactly what the test does [17:10:22] or is the issue i just had really a non issue? (it runs the test and fails the puppet run and service restart) [17:10:23] service refresh should include the test [17:10:26] <_joe_> and then if the test is ok it reloads the config [17:10:38] ok [17:10:42] <_joe_> jzerebecki: it does [17:10:53] <_joe_> it's in the reload action of the apache2 init script [17:10:56] RECOVERY - Host lvs1012 is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [17:10:57] _joe_: so why is this horrible exec stuff in puppet dupplicating that functionality? [17:10:59] <_joe_> its reload vs restart [17:11:21] <_joe_> restart in the init.d stops, tests, starts [17:11:30] <_joe_> while we want to test first, and stop later [17:11:35] <_joe_> that's why we have that exec [17:11:36] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 9 below the confidence bounds [17:11:43] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1663396 (10VBaranetsky) Hi Dzahn, This is Victoria Baranetsky, Jonathan's supervising attorney for this assignment. Jonathan has my approval to access the statistics. Thank you for... [17:12:00] <_joe_> also we DON'T want to restart apache [17:12:11] <_joe_> we want to do a graceful reload 99.9% of the times [17:12:29] _joe_: so why do we by these execs do the order of reload with a restart? [17:12:33] Error: /Stage[main]/Apache/Service[apache2]: Failed to call refresh: Could not restart Service[apache2]: Execution of '/usr/sbin/service apache2 reload' returned 1: [17:12:34] Error: /Stage[main]/Apache/Service[apache2]: Could not restart Service[apache2]: Execution of '/usr/sbin/service apache2 reload' returned 1: [17:12:34] So then I’m merely parsing this wrong. It did the test and the test failed, so then it didn’t reload apache? [17:13:03] robh: no that is reload failed [17:13:12] 6operations, 10Annual-Report, 5Patch-For-Review: create git/gerrit repo for annual report 2015 - https://phabricator.wikimedia.org/T112928#1663399 (10Dzahn) 5Open>3Resolved I'm calling this resolved. If we need to talk about +2 permissions we can do that separately. [17:13:23] <_joe_> robh: exactly [17:13:28] directly after, an apacheconfigtest failed as well [17:13:32] <_joe_> robh: the reload does the test for you [17:13:34] but i suppose apache was running the entire time with its old config [17:13:39] <_joe_> yes [17:13:41] ok [17:13:59] so then next puppet run the file was placed in already post test last run (likely) and then passes [17:14:00] <_joe_> jzerebecki: so, we don't want apache to go down for a wrong config [17:14:10] so what's up with https://gerrit.wikimedia.org/r/#/c/230483/ - is it not good? [17:14:16] the system is more robust than i gave t credit for [17:14:30] 6operations, 6Discovery, 7Elasticsearch: Ferm doesn't update @resolve hostnames on IP change - https://phabricator.wikimedia.org/T113380#1663408 (10chasemp) 3NEW a:3MoritzMuehlenhoff [17:14:30] * jzerebecki is confused now [17:14:40] <_joe_> jzerebecki: why confused? [17:14:48] <_joe_> let me explain again [17:14:59] <_joe_> apache sysvinit by default does: [17:15:15] <_joe_> reload: configtest => reload [17:15:17] restart = stop check start [17:15:23] <_joe_> restart: stop configtest start [17:15:28] yup [17:15:31] <_joe_> the latter would be disruptive for us [17:15:37] yes [17:15:42] <_joe_> if we need a restart and we fucked up something [17:15:59] <_joe_> so we first check the config once more, if it's ok we do the restart [17:16:17] <_joe_> it's redundant but implements the very important BSTS pattern [17:16:29] <_joe_> "Better Safe Than Sorry" [17:17:10] jzerebecki, robh - what's the story with https://gerrit.wikimedia.org/r/#/c/230483/ ? [17:17:24] SMalyshev: its merged and live now [17:17:28] and gave me some trouble =] [17:17:28] so what we declared in puppet is check => restart => service apache2 (which includes reload on notification); config => service apache2 [17:17:41] robh: it says reverted there in gerrit? [17:17:49] (03Abandoned) 10RobH: broke when applied to mw1224, reverting Revert "Create real URIs for wikidata RDF URIs" [puppet] - 10https://gerrit.wikimedia.org/r/240123 (owner: 10RobH) [17:18:01] i put in a revert because i thought i was going to need it, but i just dropped it [17:18:13] robh: ahh [17:18:17] <_joe_> robh: well, either you fix it or you revert it :) [17:18:20] just an order of operation during implement (puppet tested apache before it had placed the new sites-enabled symlink) [17:18:21] * SMalyshev was confused a bit [17:18:24] SMalyshev: teh revert wasn't merged [17:18:28] gerrit is confusing about that [17:18:33] _joe_: I was reverting it becuase i thought it was broken [17:18:37] 6operations, 10MediaWiki-ResourceLoader, 6Performance-Team, 10Traffic, 5Patch-For-Review: [Research] Investigate 30% load.php reqs increase since 2015-07-30 - https://phabricator.wikimedia.org/T113007#1663439 (10Catrope) This appears to be resolved now? The percentage of long cache headers in 304 respons... [17:18:41] but it turnsout it just was diong its tests and warning [17:18:41] <_joe_> robh: can I see a paste of the error? [17:18:54] robh: so it's fine, right? [17:18:58] <_joe_> robh: nope, is the patch applied everywhere? [17:19:09] yes a paste of the full error with actual order would be helpfull to aleviate my confusion [17:19:23] <_joe_> mine too [17:19:31] <_joe_> because I was sure to have understood what happened [17:19:38] <_joe_> and now I'm not so sure anymore [17:19:49] <_joe_> robh: where is this patch applied? [17:19:57] <_joe_> because it broke apache2 config I think [17:20:06] https://phabricator.wikimedia.org/P2075 [17:20:20] _joe_: its merged so its being applied on all of them [17:20:35] 6operations, 10Annual-Report, 5Patch-For-Review: create git/gerrit repo for annual report 2015 - https://phabricator.wikimedia.org/T112928#1663454 (10Stephmonette) Hi @dzahn! I'll talk to Liam today and confirm that he logged into Phabricator. [17:20:36] this is what ive been mildly freaking out about, as im not convinced im not breakign things [17:21:12] Could not open configuration file /etc/apache2/sites-enabled/wikidata-uris.incl <-- this looks like puppet run not finished or some dependency missing? [17:21:15] So, I merged https://gerrit.wikimedia.org/r/#/c/230483/ and halted puppet on ALL the mw systems. applied it just to mw1224 and it had that error [17:21:25] <_joe_> robh: uhm [17:21:41] <_joe_> file { '/etc/apache2/sites-enabled/wikidata-uris.incl': 29 [17:21:41] <_joe_> ensure => present, 30 [17:21:41] <_joe_> source => 'puppet:///modules/mediawiki/apache/sites/wikidata-uris.incl', 31 [17:21:44] <_joe_> before => Service['apache2'], 32 [17:21:46] if the answer is 'you seriously fucked up' thats what ive been saying for 45 minutes [17:21:46] <_joe_> } [17:21:52] <_joe_> so it's a damn puppet bug [17:21:57] yes [17:22:02] ohh [17:22:03] <_joe_> robh: a full paste of that puppet run please? [17:22:03] so is wikidata-uris.incl not there even after another puppet run? [17:22:17] <_joe_> mutante: I guess it is, now syntax is ok [17:22:31] _joe_: refresh [17:22:34] fixed with full run [17:22:47] mutante: it is after it [17:22:55] it just seems to fire the test before the file is placed is all [17:23:04] <_joe_> robh: full run as opposed to a tagged run? [17:23:04] my fear was that it was firing the test and reloading, and then failing apache [17:23:05] <_joe_> AH! [17:23:15] oh shit, was that it, i didnt tag my run? [17:23:21] yes. [17:23:23] ah, so there was not a full puppet run, but only the tagged pupppe? [17:23:26] puppet run [17:23:29] no, i did a run run [17:23:32] full run [17:23:40] <_joe_> uhm then it's a puppet bug [17:23:42] how does a full run error and not a tagged run [17:23:53] <_joe_> chasemp: discard that [17:23:57] 6operations, 10RESTBase, 10RESTBase-Cassandra: rename cassandra cluster - https://phabricator.wikimedia.org/T112257#1663467 (10fgiunchedi) reopening, the cassandra configuration template uses `cluster_name: %{::site}` which is an unfortunate choice because it means cassandra clusters in different sites won't... [17:24:05] i break things. [17:24:06] <_joe_> chasemp: the point is puppet didn't install a file, while it should've [17:24:19] chasemp: who said a tagged run didnt have an error? [17:24:33] <_joe_> robh: can I have a full puppet run paste please? [17:24:37] i did! [17:24:40] its in the pastebin [17:24:41] reload [17:24:42] <_joe_> something tells me we're missing part of the problem [17:24:45] <_joe_> oh ok [17:24:50] I'm going silent so I don't confuse the convo here :) just following along [17:25:05] https://phabricator.wikimedia.org/P2075 has a full paste of the run [17:25:07] it was not tagged. [17:25:21] (so whoever brought that up can discard it) [17:25:37] <_joe_> ok I have just one explanation [17:25:54] <_joe_> for some reason you ran this before the secondary puppetmaster caught the patch [17:26:04] <_joe_> maybe puppet merge gave an error and failed? [17:26:13] lemme check my backlog and see if i have it [17:26:49] 6operations, 10MediaWiki-ResourceLoader, 6Performance-Team, 10Traffic, 5Patch-For-Review: [Research] Investigate 30% load.php reqs increase since 2015-07-30 - https://phabricator.wikimedia.org/T113007#1663486 (10Krinkle) 5Open>3Resolved a:3Krinkle Yep. Request total of load.php, long-expiry 304 res... [17:27:01] <_joe_> did you do other merges after that one? [17:27:10] <_joe_> because that would've fixed it [17:27:10] _joe_: nope, ran fine and it was solo, lemme make another pastebin [17:27:15] i guess this soudlnt be public? dunno [17:27:17] <_joe_> robh: no need [17:27:21] but no merge errors [17:27:34] Connection to strontium.eqiad.wmnet closed after full update [17:27:50] <_joe_> robh: anyways, for some reason puppet computed the change on the file but not the change to the manifest on the first run [17:28:00] the main thing is 'rob didnt crash the site, if only cuz he is lucky' [17:28:13] i was pretty sure i had crashed the site into a slow burn decent. [17:28:33] <_joe_> robh: thank ori and yours truly :) [17:28:47] <_joe_> we did make the whole process as safe as possible [17:29:16] indeed, much thanks [17:29:23] redundant testing is awesome [17:30:00] _joe_: im not sure if I can see an actual take-away change to behavior or implementation on this though [17:30:22] <_joe_> ? [17:30:24] though perhaps we should re-create the issue in another apache change (of a similar slant) to see if it'll keep happening? [17:30:40] this was not expected behavior... so there is something to fix or change [17:30:41] <_joe_> no it won't [17:30:56] just puppet fluke then? [17:30:56] <_joe_> it's not part of the puppet manifests that you want to fix [17:31:04] <_joe_> we ran in a classical race condition [17:31:20] <_joe_> where the puppet code wasn't updated yet, but the file was [17:31:23] 6operations, 6Discovery, 7Elasticsearch: Ferm doesn't update @resolve hostnames on IP change - https://phabricator.wikimedia.org/T113380#1663520 (10chasemp) [17:31:33] <_joe_> happens from time to time, in less scary ways [17:31:58] im perhaps overly touchy about errors in apache because its how ive crashed the site in the past =P [17:32:20] but my day went to shit in the span of 10 minutes, heh. glad its not as bad as i thought. [17:32:44] SMalyshev: your take away is your patch was good, damn puppet! thanks for submitting it =] [17:33:04] robh: thanks! :) [17:33:28] _joe_: thank you for the explanation, it is appreciated. [17:33:40] 6operations, 10ops-eqiad, 5Patch-For-Review: Decommission mw1031 - https://phabricator.wikimedia.org/T113079#1663532 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson this portion of the ticket is done...created a wipe ticket for on-site queue [17:34:23] _joe_: so even though it is done in one commit puppet master may send out a changed template before a changed .pp? [17:34:41] <_joe_> atemplate, no [17:34:53] <_joe_> a file, yes [17:35:06] outch [17:35:47] is there a way to tell puppetmaster wait, i'm changing your config. then now go on i'm done changing` [17:35:55] ? [17:36:39] 6operations, 10Annual-Report, 5Patch-For-Review: create git/gerrit repo for annual report 2015 - https://phabricator.wikimedia.org/T112928#1663560 (10Dzahn) Hi @Stephomonette sounds good:) Let me know if you have any questions about Phabricator or Gerrit. [17:38:08] 6operations, 10ops-eqiad: label server nobelium / wmf4543 / update racktables - https://phabricator.wikimedia.org/T113281#1663576 (10Cmjohnson) added label and updated racktables [17:38:17] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [17:38:29] 6operations, 10ops-eqiad: label server nobelium / wmf4543 / update racktables - https://phabricator.wikimedia.org/T113281#1663581 (10Cmjohnson) 5Open>3Resolved [17:42:02] 6operations, 10RESTBase, 10RESTBase-Cassandra: rename cassandra cluster - https://phabricator.wikimedia.org/T112257#1663602 (10Eevans) >>! In T112257#1663467, @fgiunchedi wrote: > reopening, the cassandra configuration template uses `cluster_name: %{::site}` which is an unfortunate choice because it means ca... [17:47:57] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:48:16] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 9 below the confidence bounds [17:55:01] (03PS1) 10Ottomata: Don't hardcode llama uid [puppet] - 10https://gerrit.wikimedia.org/r/240136 (https://phabricator.wikimedia.org/T100678) [17:55:35] greg-g: Can I have a deployment window for Flow at 3pm? [17:59:13] (03CR) 10Ottomata: [C: 032] Don't hardcode llama uid [puppet] - 10https://gerrit.wikimedia.org/r/240136 (https://phabricator.wikimedia.org/T100678) (owner: 10Ottomata) [18:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150922T1800). Please do the needful. [18:02:10] 6operations, 6Discovery, 7Elasticsearch: Ferm doesn't update @resolve hostnames on IP change - https://phabricator.wikimedia.org/T113380#1663683 (10chasemp) The temp solution is: sed -i '/CACHE=yes/c\CACHE=no' /etc/default/ferm && ferm --slow /etc/ferm/ferm.conf && puppet agent --test Set cache to no... [18:03:35] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [18:03:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 8 below the confidence bounds [18:06:00] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1663686 (10Reinrosemary) Edward Galvez should be granted access to Hive for portal metrics. [18:06:28] (03PS1) 10Faidon Liambotis: Install mail.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/240139 [18:06:49] (03CR) 10Faidon Liambotis: [C: 032] Install mail.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/240139 (owner: 10Faidon Liambotis) [18:06:51] train is a little behind schedule today [18:07:16] twentyafterfour: I'm going to report this to my local train station :( [18:07:45] RECOVERY - Disk space on labstore1002 is OK: DISK OK [18:08:25] twentyafterfour: are you in sf? [18:08:34] if this was Japan we'd all get a paper to show our boss :) [18:11:25] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [18:11:32] (03PS1) 10Faidon Liambotis: exim: fix exim4::dkim's content parameter [puppet] - 10https://gerrit.wikimedia.org/r/240141 (https://phabricator.wikimedia.org/T113051) [18:11:54] (03CR) 10Faidon Liambotis: [C: 032 V: 032] exim: fix exim4::dkim's content parameter [puppet] - 10https://gerrit.wikimedia.org/r/240141 (https://phabricator.wikimedia.org/T113051) (owner: 10Faidon Liambotis) [18:12:08] paravoid: :) duh [18:12:33] palladium is having a bit of a heartattack atm cpu wise [18:13:27] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: exim4::dkim creates empty key file - https://phabricator.wikimedia.org/T113051#1663711 (10faidon) 5Open>3Resolved a:3faidon Confirmed it fixes the issue. [18:13:27] mutante: duh indeed :) [18:13:28] robh: the temptation to ask for sodium to be allocated to the mailman project is high :P [18:13:44] robh: you know, legacy and so ;) [18:13:55] chasemp: i pushed palladium to the edge with my puppet swat an hour ago [18:14:03] i imagine that it'll have repercussions, fyi [18:15:49] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1663719 (10egalvezwmf) @Dzahn - already signed. Thanks! [18:16:05] 6operations, 10ops-codfw: wipe working spare disk in codfw - https://phabricator.wikimedia.org/T112783#1663720 (10Papaul) I found a 2.5 SATA 250 Gb drive, this is the smallest drive I have on site. on the server, the RAID controller card doesn't allow me to create a RAID 0 or 1 with a single drive the minimum... [18:16:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:18:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [18:20:54] chasemp: no I'm not in SF [18:28:44] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1663761 (10Dzahn) a:5Reinrosemary>3coren Thanks all, handing over to Coren as the "on duty" guy this week. [18:28:45] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [18:29:08] Hm. I should take an hour and do all of those. [18:31:12] (03CR) 10Dzahn: "fixing that: https://gerrit.wikimedia.org/r/#/c/240034/" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [18:31:35] (03CR) 10Dzahn: "after https://gerrit.wikimedia.org/r/#/c/240083/ this can also go" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [18:33:24] Coren: yes please, the requests for stat1002 are never ending. more and more people seem to need those numbers for some reason [18:33:50] They be tasty numbers. :-) [18:34:21] Ima sit down and do those in batch once I'm done with what I'm in atm. [18:35:17] 6operations, 7Mail: Replace primary mail relays (polonium/lead) - https://phabricator.wikimedia.org/T113211#1663804 (10faidon) Google Apps was updated by OIT. MXes for all domains except wikimedia.org and its subdomains have been switched. wiki-mail-eqiad was switched as well. wikimedia.org and subdomains wil... [18:36:30] Coren: cool, thanks [18:38:12] (03PS10) 10Andrew Bogott: toolschecker: Add tests for starting/stopping web services [puppet] - 10https://gerrit.wikimedia.org/r/239504 [18:38:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [18:40:55] (03CR) 10Andrew Bogott: [C: 032] toolschecker: Add tests for starting/stopping web services [puppet] - 10https://gerrit.wikimedia.org/r/239504 (owner: 10Andrew Bogott) [18:42:25] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [18:57:21] (03PS1) 10Ori.livneh: [WIP] Simplify sentry module [puppet] - 10https://gerrit.wikimedia.org/r/240150 [18:59:17] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [19:02:51] (03Restored) 10Chad: Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 (owner: 10Chad) [19:03:00] (03PS6) 10Chad: Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 [19:03:17] (03CR) 10jenkins-bot: [V: 04-1] Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 (owner: 10Chad) [19:03:35] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [19:05:55] (03PS2) 10Dzahn: mira: remove inclusion of releases::upload [puppet] - 10https://gerrit.wikimedia.org/r/240034 [19:07:07] (03CR) 10Dzahn: [C: 032] "noop because it is in the applied role class too" [puppet] - 10https://gerrit.wikimedia.org/r/240034 (owner: 10Dzahn) [19:09:38] (03CR) 10Dzahn: "after this is merged i would like to get https://gerrit.wikimedia.org/r/#/c/223458/ merged too, the diff was the firewall class and we wan" [puppet] - 10https://gerrit.wikimedia.org/r/240083 (owner: 10Muehlenhoff) [19:10:57] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:12:43] twentyafterfour: is wmf24 still getting rolled out to group0 today? [19:13:13] (03CR) 10John F. Lewis: [C: 031] "The script seems to follow the rename procedure well and I can't find any fault in the script itself." [puppet] - 10https://gerrit.wikimedia.org/r/240024 (owner: 10Dzahn) [19:13:32] (03PS7) 10Dzahn: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [19:14:32] (03CR) 10jenkins-bot: [V: 04-1] Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [19:15:05] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [19:16:06] (03PS7) 10Chad: Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 [19:18:23] (03PS8) 10Dzahn: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [19:22:04] ostriches: do you know when wmf24 is rolling out? is it still happening today? [19:22:50] Should be? ping twentyafterfour... [19:23:05] I did, above [19:23:17] (03CR) 10Dzahn: "to be merged after https://gerrit.wikimedia.org/r/#/c/240083/ and another rebase" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [19:23:58] ori: I was pinging him :p [19:24:09] Oh :) [19:24:42] it was a https://en.wikipedia.org/wiki/Speech_act [19:27:07] (03PS1) 10Andrew Bogott: Toolschecker: fix a few typos [puppet] - 10https://gerrit.wikimedia.org/r/240158 [19:27:51] 6operations, 10ops-ulsfo: Properly patch Telia @ ulsfo - https://phabricator.wikimedia.org/T112152#1664045 (10RobH) I'm currently onsite to document the steps needed for correcting this issue. The patch currently routes from the fiber intake duct in the top of the rack (with all the other fiber cross-connects... [19:29:02] (03CR) 10Andrew Bogott: [C: 032] Toolschecker: fix a few typos [puppet] - 10https://gerrit.wikimedia.org/r/240158 (owner: 10Andrew Bogott) [19:35:13] (03PS1) 10Andrew Bogott: toolschecker: correct the location of 'webservice' [puppet] - 10https://gerrit.wikimedia.org/r/240159 [19:35:45] 6operations, 10RESTBase, 6Services: RESTBase and domain renames - https://phabricator.wikimedia.org/T113307#1664054 (10Eevans) @mobrovac said: > I personally feel option 3 is the way to go. I agree, with one But. > ...looks for data associated with a.wp.org and updates it to b.wp,org if such data is found.... [19:36:42] (03CR) 10Andrew Bogott: [C: 032] toolschecker: correct the location of 'webservice' [puppet] - 10https://gerrit.wikimedia.org/r/240159 (owner: 10Andrew Bogott) [19:37:29] 6operations, 10ops-ulsfo: order more 1m lc/sc patches for ulsfo - https://phabricator.wikimedia.org/T113401#1664055 (10RobH) 3NEW a:3RobH [19:38:34] can someone help me flush the negative address cache on polonium? [19:38:55] trying to troubleshoot a new address creation, which is in ldap, but polonium is blocking.. [19:39:15] heya hashar, yt? [19:39:20] Coren: know anything about ^^^ process? [19:40:31] ori: yes it's just late [19:40:33] 6operations, 10ops-ulsfo: Move NTT @ ulsfo to a different cross-connect - https://phabricator.wikimedia.org/T112154#1664075 (10RobH) I'm currently onsite to document the steps needed to correct this. Faidon has looped me in with the LoA via direct email with the vendor. This transit connection currently att... [19:40:51] ottomata: hello :) [19:41:54] hiya [19:42:00] https://integration.wikimedia.org/ci/job/tox-py27/3194/console [19:42:11] i'm doing a similar thing to what _joe_ does for conftool [19:42:22] but, this test requires that etcd is installed [19:42:41] the test starts up an etcd server process and then uses it [19:42:49] ottomata: yeah and to solve that we use a Debian Jessie system [19:42:54] k [19:43:05] howso? [19:43:16] ottomata: so I think your repo definition in zuul layout needs to change tox-py27 by tox-py27-jessie [19:43:50] ottomata: that is for mediawiki/extensions/EventLogging right ? [19:43:54] yes [19:44:21] ottomata: clone integration/config.git then in /zuul/layout.yaml look for EventLogging [19:44:44] the tox-py27 jobs currently runs on Precise [19:45:16] OO ok [19:45:22] ottomata: in the same file if you look up for joe conftool, you will see his repo uses tox-py27-jessie [19:45:25] we recently moved it to trusty, so this is a good change, is jessie necessary? [19:45:29] probably to install etcd, eh? [19:45:45] yeah jessie slaves are the only ones having etcd afaik [19:46:08] hm, ok, i think eventlogging works on jessie now, so it shoudl be ok [19:46:09] I will probably migrate all the tox jobs to Jessie soonish [19:47:54] 6operations, 10ops-codfw: cp4005 / cp4014: Description: The system board PS2 PG Fail voltage is outside of range. on cp4005 and cp4014. - https://phabricator.wikimedia.org/T113403#1664088 (10RobH) 3NEW a:3RobH [19:48:47] hashar: https://gerrit.wikimedia.org/r/#/c/240161/1/zuul/layout.yaml ? [19:49:14] ottomata: yeah looks legit :-} [19:49:42] 6operations, 10ops-codfw: cp4005 / cp4014: Description: The system board PS2 PG Fail voltage is outside of range. on cp4005 and cp4014. - https://phabricator.wikimedia.org/T113403#1664105 (10RobH) 5Open>3Resolved cp4005: ------------------------------------------------------------------------------- Recor... [19:50:29] ottomata: deployed. You can now comment 'recheck' on Gerrit change https://gerrit.wikimedia.org/r/#/c/238854/ [19:50:35] ottomata: will retriever the tests [19:50:55] err retry [19:53:45] ottomata: https://integration.wikimedia.org/ci/job/tox-py27-jessie/36/console *whistles* [19:54:18] sorry hashar, internet probs [19:54:56] ottomata: https://integration.wikimedia.org/ci/job/tox-py27-jessie/36/console *whistles* [19:55:14] ottomata: so it is now running on Jessie, still fails but for different reason [19:55:42] ! [19:55:43] hm [19:56:06] different version of etcd client? naww thats right. [19:56:08] hm. [19:57:03] hashar: i'm not sure how to figure that one out without investigating the environment [19:57:49] (03CR) 10Dzahn: [C: 04-1] "i wanted to confirm all these things on mira, because mira has the same role and already base::firewall, but found it to be broken on mira" [puppet] - 10https://gerrit.wikimedia.org/r/240083 (owner: 10Muehlenhoff) [20:02:47] wmf/1.26wmf24 is almost ready ... [20:03:26] ottomata: the slave has etcd 2.0.10-1 if it matters [20:03:45] ottomata: and apparently there is no etcd.Client but a lower case one: etcd.client [20:04:14] ottomata: aka: etcd.client.Client() [20:06:37] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1664172 (10Dzahn) Hi @VBaranetsky thanks for the approval. Coren will follow-up with this ticket to get the access for Jonathan. [20:06:54] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1664173 (10Dzahn) a:3coren [20:07:31] ottomata: might want to poke _joe_ about it tomorrow [20:07:36] I am heading out myself (sleep time) [20:07:57] (03PS1) 10Ottomata: Increase max_allowed_packet and set temporarily read_only = 1 for analytics mysql meta [puppet] - 10https://gerrit.wikimedia.org/r/240171 [20:08:10] ah, i think that does matter [20:08:16] hm [20:08:40] hashar: the setup.py says >= 0.4.0, and I think that is what is available via .deb [20:08:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [20:09:05] maybe if I make it say == 0.4.0? [20:09:40] ohh [20:09:41] hashar [20:09:53] 2.0.10?? [20:09:56] for python etcd? [20:09:58] hm. [20:10:03] that's not a version i'm aware of [20:10:40] hm, ok, hm. wonder why this works for me.. [20:10:41] weird. [20:12:51] (03CR) 10Ottomata: [C: 032] Increase max_allowed_packet and set temporarily read_only = 1 for analytics mysql meta [puppet] - 10https://gerrit.wikimedia.org/r/240171 (owner: 10Ottomata) [20:14:20] ottomata: virtualenv should not use the .deb version . And apparently it installed etcd 2.0.8 [20:14:33] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/220489 (owner: 10Andrew Bogott) [20:15:24] hashar, amd trying etc.client.Client, that is a valid Class in 0.4.0 too. [20:15:31] its just exported by __init__.py in 0.4.0 [20:15:41] 2.0.8 must be in a different source repo than i'm looking at [20:16:10] bah! [20:16:16] AttributeError: 'module' object has no attribute 'client' [20:16:36] :( [20:16:57] (03PS3) 10Andrew Bogott: Remove some uses of scope.lookupvar by passing args more explicitly. [puppet] - 10https://gerrit.wikimedia.org/r/220489 [20:17:28] hm hashar [20:17:35] i think whatever virtual env is installing is a different lib [20:18:08] i think its installing this [20:18:08] https://github.com/dsoprea/PythonEtcdClient [20:18:15] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [20:18:29] not this: https://github.com/jplana/python-etcd [20:18:47] OHHHHH [20:18:49] name='python-etcd' [20:18:53] got it! :) [20:18:53] ahahha [20:18:59] I feel very sorry [20:19:06] I had the exact same issue with statsd iirc [20:19:51] ottomata: kudos on figuring out the name madness [20:20:02] thanks for your help too [20:20:05] hopefully this will fix. [20:20:05] :) [20:20:37] (03CR) 10Andrew Bogott: "Verified that this is a no-op on the beta puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/220489 (owner: 10Andrew Bogott) [20:20:44] don't worry, I once spent a good one hour because : is not ; [20:21:25] (03PS1) 1020after4: 1.26wmf24 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240176 [20:21:27] (03CR) 10Andrew Bogott: [C: 032] Remove some uses of scope.lookupvar by passing args more explicitly. [puppet] - 10https://gerrit.wikimedia.org/r/220489 (owner: 10Andrew Bogott) [20:21:48] (03CR) 1020after4: [C: 032] 1.26wmf24 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240176 (owner: 1020after4) [20:21:50] (03CR) 10MarcoAurelio: "Are you sure, Glaisher? This should work for wikis where EP is installed. If the wiki does not have the EP-ext, it will not change anythin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240049 (https://phabricator.wikimedia.org/T112806) (owner: 10MarcoAurelio) [20:21:54] (03Merged) 10jenkins-bot: 1.26wmf24 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240176 (owner: 1020after4) [20:22:50] !log twentyafterfour@tin Started scap: Test 1.26wmf24 [20:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:23:58] ahh crap hashar that fixed [20:24:17] but now there are issues with the mediawiki extension tests? [20:24:22] these should probably be separate repos...:/ [20:24:54] (03PS5) 10Andrew Bogott: Add an additional puppet config to use with minimal runs. [puppet] - 10https://gerrit.wikimedia.org/r/221563 [20:26:20] (03CR) 10Dduvall: Remove some uses of scope.lookupvar by passing args more explicitly. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/220489 (owner: 10Andrew Bogott) [20:26:32] 6operations: check mailman queue size monitoring - https://phabricator.wikimedia.org/T113326#1664231 (10JohnLewis) Looking into this, the check is running with insufficient privileges to be able to check the directories it is. It is set to error a 42 yet seemingly has never exceed that despite the server saying... [20:26:43] (03CR) 10Alex Monk: "Also this doesn't cover ruwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240049 (https://phabricator.wikimedia.org/T112806) (owner: 10MarcoAurelio) [20:27:34] (03CR) 10Alex Monk: "Yes, let's do it inside a wmgUseEducationProgram check please." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240049 (https://phabricator.wikimedia.org/T112806) (owner: 10MarcoAurelio) [20:27:51] ottomata: kudos :-) [20:27:56] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:28:27] hashar: yeah but it broke the mediawiki part [20:28:29] so mehHH? [20:29:30] ottomata: seems unrelated. Maybe something has been broken in mediawiki/core [20:29:49] it didn't break until I changed the tox thing though [20:29:57] you can see on previous patchsets that those two passed [20:30:23] 6operations: check mailman queue size monitoring - https://phabricator.wikimedia.org/T113326#1664234 (10JohnLewis) Furthermore we shouldn't check the shunt directory. in, out, virgin and bounces should be the directories we check. [20:30:56] ottomata: seems like a recent breakage in one of the multiple repos :-D [20:33:22] sleep time *wave* [20:34:26] hmm, ok [20:34:29] thanks for your help! [20:37:26] PROBLEM - Disk space on helium is CRITICAL: DISK CRITICAL - free space: /srv/baculasd2 17630 MB (3% inode=99%) [20:37:55] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [20:39:47] (03PS1) 10Andrew Bogott: toolschecker: Give services 10 seconds to stop as well as start. [puppet] - 10https://gerrit.wikimedia.org/r/240258 [20:39:48] oh, helium.. no [20:40:35] damn it, that's because we are backup up the pmtpa home dirs.. i did check the disk space though and it looked enough [20:40:45] backing up [20:41:42] (03CR) 10Andrew Bogott: [C: 032] toolschecker: Give services 10 seconds to stop as well as start. [puppet] - 10https://gerrit.wikimedia.org/r/240258 (owner: 10Andrew Bogott) [20:42:24] greg-g: [10:55] RoanKattouw greg-g: Can I have a deployment window for Flow at 3pm? [20:44:33] !log cancel backup job of bast1001 on helium because running low on disk [20:44:34] hey guys, I've a question about your Cassandra cluster [20:44:36] PROBLEM - HHVM rendering on mw2187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 689 bytes in 0.095 second response time [20:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:44:48] (03PS1) 10Alex Monk: Copy default $wgEchoDefaultNotificationTypes['emailuser'] into wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240259 (https://phabricator.wikimedia.org/T113367) [20:45:05] PROBLEM - Apache HTTP on mw2187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 689 bytes in 0.087 second response time [20:45:11] in a multi-dc cluster with GossipingPropertyFileSnitch, if you run a nodetool info, do you see the correct DC and RACK? [20:46:32] I'm checking if I hit a bug or it's somehow related to my configuration [20:47:34] That mw2187 alert is due to "MWException from line 469 of /srv/mediawiki/php-1.26wmf23/includes/cache/LocalisationCache.php: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php." [20:47:59] is that a deploy problem? [20:48:00] (03CR) 10Andrew Bogott: "thanks, Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/220489 (owner: 10Andrew Bogott) [20:48:56] 6operations, 5Patch-For-Review: Backup and decom home_pmtpa - https://phabricator.wikimedia.org/T113265#1664322 (10Dzahn) I merged the Gerrit change and Bacula started creating the backup. I had glanced at the disk space on helium before doing that and it seemed enough but nevertheless this happened: < icinga... [20:49:02] (03CR) 10Andrew Bogott: [C: 031] maintain-replicas: Do not record centralauth in meta_p.wiki [software] - 10https://gerrit.wikimedia.org/r/221042 (https://phabricator.wikimedia.org/T101750) (owner: 10Alex Monk) [20:49:15] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1664323 (10Cmjohnson) Moved elastic1030 to row A3 Moved elastic1005 to row D4 Racktables has been updated. [20:49:24] chasemp, maybe? I wonder if sync-common might fix it [20:49:37] bit weird that it's host-specific [20:52:38] yes, sync-common should fix it [20:57:40] (03PS1) 10Rush: elastic: update rack location for 1005 and 1030 [puppet] - 10https://gerrit.wikimedia.org/r/240265 [20:58:22] 6operations, 10RESTBase, 10RESTBase-Cassandra: rename cassandra cluster - https://phabricator.wikimedia.org/T112257#1664371 (10mobrovac) >>! In T112257#1663602, @Eevans wrote: > I prefer the latter: I tend to agree. Entering risky procedures at this point would be plain foolish. > Having this string be 'eq... [21:00:16] (03PS2) 10Rush: elastic: update rack location for 1005 and 1030 [puppet] - 10https://gerrit.wikimedia.org/r/240265 [21:03:00] (03CR) 10MarcoAurelio: "Okay, would:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240049 (https://phabricator.wikimedia.org/T112806) (owner: 10MarcoAurelio) [21:03:24] Krenair: ^ [21:04:06] (03CR) 10Alex Monk: "No" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240049 (https://phabricator.wikimedia.org/T112806) (owner: 10MarcoAurelio) [21:04:26] lol [21:05:02] mafk, NS_EP does not exist [21:05:08] 'default' does not make sense [21:05:50] You want to $wgNamespaceProtection[EP_NS] = array( 'ep-course' ); [21:06:25] twentyafterfour: hows the train deploy going? [21:07:34] ebernhardson: so we have the machine now. Now what? :) [21:08:07] I guess we apply an elastic role that makes it have an elastic cluster of 1node [21:08:12] And open appropriate ports [21:08:13] 6operations, 10ops-ulsfo: Move NTT @ ulsfo to a different cross-connect - https://phabricator.wikimedia.org/T112154#1664406 (10RobH) emailed support with the following: In working on our cross-connections, I noticed we have a link that isn't properly terminated in our patch panel, but simply run directly to o... [21:08:16] Krenair: I was thinking on a $wmfConfigDir/educationprogram.php and adding $wgNamespaceProtection[EP_NS] = array( 'ep-course' ); on it, but if it's just adding $wgNamespaceProtection[EP_NS] = array( 'ep-course' ); on CommonSettings.php that'd be fine [21:08:42] 6operations, 10ops-ulsfo: Properly patch Telia @ ulsfo - https://phabricator.wikimedia.org/T112152#1664407 (10RobH) emailed support with the following: In working on our cross-connections, I noticed we have a link that isn't properly terminated in our patch panel, but simply run directly to one of our routers... [21:11:12] (03PS1) 10Catrope: Enable Flow opt-in on testwiki for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240270 [21:12:27] yuvipanda: yes, elasticsearch instance w/ single node. there is a hieradata variable you have to set to 'name' the cluster [21:12:56] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1664414 (10BBlack) The issue with the asw2-a5 SFPs was fixed. We've still got the blocking issue with Row D SNMP, for which we have the new LVS ports disabled (... [21:13:05] csteipp: ping [21:13:09] yuvipanda: then nginx proxying GET to 9200. To actually start loading data needs a patch thats still in gerrit (but is mostly ready, david will do more testing and maybe merge tomorrow) [21:13:24] !log twentyafterfour@tin Finished scap: Test 1.26wmf24 (duration: 50m 34s) [21:13:27] mafk: what can I do for you? [21:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:23] (03CR) 10EBernhardson: "looks legit to me, although for readability purposes i would probably just list all the servers instead of trying to regex it together." [puppet] - 10https://gerrit.wikimedia.org/r/240265 (owner: 10Rush) [21:16:56] (03CR) 10EBernhardson: [C: 031] elastic: update rack location for 1005 and 1030 [puppet] - 10https://gerrit.wikimedia.org/r/240265 (owner: 10Rush) [21:17:11] (03CR) 10Rush: [C: 032] elastic: update rack location for 1005 and 1030 [puppet] - 10https://gerrit.wikimedia.org/r/240265 (owner: 10Rush) [21:17:42] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1664423 (10chasemp) [21:18:18] 10Ops-Access-Requests, 6operations: RESTBase Admin access on aqs1001, aqs1002, and aqs1003 for Joseph and Dan - https://phabricator.wikimedia.org/T113416#1664426 (10Milimetric) 3NEW [21:20:19] ebernhardson: think you have time to put up puppet patches? If not I can do some later tonight [21:21:21] (03PS1) 10Tim Landscheidt: labs_lvm: Require parted explicitly [puppet] - 10https://gerrit.wikimedia.org/r/240271 (https://phabricator.wikimedia.org/T112641) [21:22:36] PROBLEM - puppet last run on elastic1006 is CRITICAL: CRITICAL: puppet fail [21:24:24] (03PS2) 10MarcoAurelio: [Security] Restrict course page editing for any wiki with EducationProgram Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240049 (https://phabricator.wikimedia.org/T112806) [21:24:31] (03CR) 10Ottomata: "Hey yall," [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [21:24:37] yuvipanda: i can poke at it, some puppet to put together the nginx proxy couldn't be too hard [21:25:07] Yeah [21:25:28] (03PS4) 10Dzahn: (WIP) mailman: script to rename list [puppet] - 10https://gerrit.wikimedia.org/r/240024 [21:29:25] (03PS1) 1020after4: group0 wikis to 1.26wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240272 [21:30:35] (03CR) 1020after4: [C: 032] group0 wikis to 1.26wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240272 (owner: 1020after4) [21:31:31] (03Merged) 10jenkins-bot: group0 wikis to 1.26wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240272 (owner: 1020after4) [21:33:04] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.26wmf24 [21:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:33:30] legoktm: sorry I didn't see your message before. The train is done, I think it all went well [21:33:45] woot [21:49:06] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: puppet fail [21:49:13] !log unban elastic1030 from T112559 [21:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:49:25] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1664511 (10chasemp) next window is planned for thursday noon EST [21:52:24] (03PS1) 10MarcoAurelio: Reverting abusefilter configuration for ee.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240278 [21:56:36] (03CR) 10CSteipp: [C: 031] [Security] Restrict course page editing for any wiki with EducationProgram Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240049 (https://phabricator.wikimedia.org/T112806) (owner: 10MarcoAurelio) [22:00:04] RoanKattouw: Respected human, time to deploy Flow references migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150922T2200). Please do the needful. [22:05:46] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [22:17:06] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:17:19] 6operations, 10ops-ulsfo: Properly patch Telia @ ulsfo - https://phabricator.wikimedia.org/T112152#1664602 (10RobH) Carlos with UnitedLayer replied back and is ready to work on the link. However, I don't have any network admins to advise if the link should get disabled before work, or to simply pull. Since b... [22:18:05] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:31:17] (03CR) 10Catrope: [C: 032] Re-enable Flow on Flow_test_talk on beta (en and ca) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240079 (owner: 10Sbisson) [22:31:38] (03CR) 10Catrope: [C: 032] Enable Flow opt-in on testwiki for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240270 (owner: 10Catrope) [22:31:40] (03Merged) 10jenkins-bot: Re-enable Flow on Flow_test_talk on beta (en and ca) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240079 (owner: 10Sbisson) [22:31:47] (03CR) 10Catrope: [C: 032] Set $wgFlowMigrateReferenceWiki to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238111 (https://phabricator.wikimedia.org/T107204) (owner: 10Catrope) [22:31:59] (03Merged) 10jenkins-bot: Enable Flow opt-in on testwiki for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240270 (owner: 10Catrope) [22:32:17] (03Merged) 10jenkins-bot: Set $wgFlowMigrateReferenceWiki to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238111 (https://phabricator.wikimedia.org/T107204) (owner: 10Catrope) [22:36:55] (03PS2) 10Alex Monk: Reverting AbuseFilter configuration for ee.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240278 (owner: 10MarcoAurelio) [22:38:44] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable Flow opt-in on testwiki for testing (duration: 00m 12s) [22:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:40:30] (03CR) 10Alex Monk: [C: 031] Reverting AbuseFilter configuration for ee.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240278 (owner: 10MarcoAurelio) [22:43:59] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Set $wgFlowMigrateReferenceWiki to false in production (duration: 00m 12s) [22:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:56:44] hi folks! Wiki Ed is going to set up SSL for dashboard.wikiedu.org. Is there anything I should know before wandering on to a random SSL provider and buying a cert (or getting a 'free for open source' one)? [23:03:30] Krenair: swat time? [23:03:37] I think I broke the [[Deployments]] page :| [23:06:48] probably [23:06:50] jouncebot, next [23:06:51] In 135 hour(s) and 53 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150928T1500) [23:06:54] I fixed it [23:07:02] RoanKattouw, you done? [23:08:00] legoktm, jenkins-bot doesn't seem happy with your changes [23:08:10] the most recent one? [23:08:35] jouncebot: refresh [23:08:36] I refreshed my knowledge about deployments. [23:08:37] https://gerrit.wikimedia.org/r/#/c/240281/ and https://gerrit.wikimedia.org/r/#/c/240283/ [23:08:40] jouncebot: next [23:08:40] In 15 hour(s) and 51 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150923T1500) [23:09:11] Krenair: right. because mw.storage fixes weren't backported to release branches [23:09:17] they're fine otherwise [23:10:44] Krenair: Yes, sorry [23:10:54] Go ahead and SWAT [23:10:57] And ignore the qunit failures [23:11:04] (03PS2) 10Alex Monk: Make commons and wikimania SUL logos transparent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239308 (https://phabricator.wikimedia.org/T72829) [23:11:10] (03CR) 10Alex Monk: [C: 032] Make commons and wikimania SUL logos transparent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239308 (https://phabricator.wikimedia.org/T72829) (owner: 10Alex Monk) [23:11:54] is jenkins not working legoktm? [23:12:15] zuul got stuck again [23:13:55] (03CR) 10Alex Monk: [V: 032] Make commons and wikimania SUL logos transparent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239308 (https://phabricator.wikimedia.org/T72829) (owner: 10Alex Monk) [23:14:36] !log krenair@tin Synchronized w/static/images/sul/commons.png: https://gerrit.wikimedia.org/r/#/c/239308/ (duration: 00m 12s) [23:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:59] !log krenair@tin Synchronized w/static/images/sul/wikimania.png: https://gerrit.wikimedia.org/r/#/c/239308/ (duration: 00m 11s) [23:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:37] (03PS2) 10Alex Monk: Copy default $wgEchoDefaultNotificationTypes['emailuser'] into wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240259 (https://phabricator.wikimedia.org/T113367) [23:15:43] (03CR) 10Alex Monk: [C: 032] Copy default $wgEchoDefaultNotificationTypes['emailuser'] into wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240259 (https://phabricator.wikimedia.org/T113367) (owner: 10Alex Monk) [23:15:58] (03CR) 10Alex Monk: [V: 032] Copy default $wgEchoDefaultNotificationTypes['emailuser'] into wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240259 (https://phabricator.wikimedia.org/T113367) (owner: 10Alex Monk) [23:16:34] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/240259/ (duration: 00m 12s) [23:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:10] (03PS3) 10Alex Monk: Reverting AbuseFilter configuration for ee.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240278 (owner: 10MarcoAurelio) [23:17:34] (03CR) 10Alex Monk: [C: 032 V: 032] Reverting AbuseFilter configuration for ee.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240278 (owner: 10MarcoAurelio) [23:18:12] !log krenair@tin Synchronized wmf-config/abusefilter.php: https://gerrit.wikimedia.org/r/#/c/240278/ (duration: 00m 12s) [23:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:55] wtf... [23:22:37] ? [23:22:43] An unmerged commit on a deployment branch [23:22:44] commit 4f36f13d2af4cb1d7c7f129c094a88b6fad2a561 [23:22:45] Author: Casey Dentinger [23:22:45] Date: Tue Sep 22 14:52:31 2015 -0600 [23:22:45] Updated mediawiki/core [23:22:45] Project: mediawiki/extensions/DonationInterface d175a10faf993b0165745428c4962c206a164b54 [23:23:06] I thought DI was special? [23:23:53] (03PS1) 10Thcipriani: Add config deployment [tools/scap] - 10https://gerrit.wikimedia.org/r/240292 (https://phabricator.wikimedia.org/T109512) [23:26:12] legoktm, not special enough for me to revert it [23:26:59] for me not to be able to revert it* [23:27:13] hrm [23:27:49] (03CR) 10jenkins-bot: [V: 04-1] Add config deployment [tools/scap] - 10https://gerrit.wikimedia.org/r/240292 (https://phabricator.wikimedia.org/T109512) (owner: 10Thcipriani) [23:28:36] (03PS5) 10Dzahn: mailman: script to rename list [puppet] - 10https://gerrit.wikimedia.org/r/240024 [23:30:31] ugh, and then there's the other mess to deal with [23:30:34] Krenair: what's wrong with the last DonationInterface deployment merge? [23:31:37] ejegg: (guessing) the recurring issue is that the DonationInterface code gets merged, and core gets updated with new submodules, but the code isn't deployed [23:31:51] ejegg: so the next person to go and deploy sees this code that will go out when they deploy, but that they know nothing about [23:32:15] it shouldn't be a submodule in the wmf/ branches, right? [23:32:40] I don't think we need to revert anything in the DonationInterface repo [23:33:30] ejegg: i imagine it doesn't need revert, correct. just needs to be deployed (even if its a no-op in prod). i'm not familiar enough with how you all deploy in the fundraising cluster to have ideas on the total solution though [23:34:03] i would maybe start a mailing list thread on the releng list to figure out the best way, since this is an intermittent issue over several months [23:34:09] I'm confused - did somebody add the extension to a branch that shouldn't have it? [23:34:24] As far as I know, it only exists under the fundraising/ branches of core [23:34:38] ejegg: when code gets merged to a wmf/* branch, some bot auto updates the relevant core submodules and merges [23:34:39] No, I believe the wmf release config is fine. We do deploy DonationInterface to donatewiki [23:34:57] it's used there as a source of message strings only, and it's deployed from master. [23:35:01] (03PS6) 10Dzahn: mailman: script to rename list [puppet] - 10https://gerrit.wikimedia.org/r/240024 [23:35:14] I'm thinking we should change our DI branch name so people know not to touch it [23:35:22] fundraising/deployment would be fine [23:35:29] (03CR) 10Dzahn: [C: 032] mailman: script to rename list [puppet] - 10https://gerrit.wikimedia.org/r/240024 (owner: 10Dzahn) [23:35:33] that would probably work i think [23:35:43] !log krenair@tin Synchronized php-1.26wmf24/extensions/Echo: https://gerrit.wikimedia.org/r/#/c/240283/ and https://gerrit.wikimedia.org/r/#/c/240281/ (duration: 00m 13s) [23:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:15] ebernhardson: Krenair: Very cool that people are poking their heads into our mess, though! [23:36:33] Krenair: sorry, what was "and no." in reference to? [23:36:46] awight: its hard not to notice when doing a `git status` before deploying anything :) [23:37:02] well, git log HEAD...origin/wmf1.26wmf24 often shows the problem too :) [23:37:06] ebernhardson: wait, this shouldn't be getting through to anything on tin [23:37:20] awight, ejegg: If DonationInterface does not run on any non-donatewiki wiki, and donatewiki doesn't run on the main cluster, then maybe DonationInterface should be excluded from wmf/* branches altogether [23:37:25] awight: i havn't followed too closely, but that was Krenair's original issue [23:37:28] That's easy to do with a change in make-wmf-branch [23:37:43] (03PS2) 10Dzahn: mailman: redirects for search lists -> discovery [puppet] - 10https://gerrit.wikimedia.org/r/238650 (https://phabricator.wikimedia.org/T110256) [23:37:52] I think we need the i18n. [23:37:56] But the log suggests Casey was messing with the 1.26wmf23 or 1.26wmf24 copy of DI [23:38:02] Urgh, right [23:38:06] awight, let me make this clear [23:38:11] (03CR) 10Dzahn: [C: 032] mailman: redirects for search lists -> discovery [puppet] - 10https://gerrit.wikimedia.org/r/238650 (https://phabricator.wikimedia.org/T110256) (owner: 10Dzahn) [23:38:19] But... why [23:38:20] Krenair: omg I see the issue [23:38:28] yeah make-wmf-branch/config.json is wrong [23:38:29] (03PS3) 10Dzahn: mailman: exim alias for discovery list renames [puppet] - 10https://gerrit.wikimedia.org/r/238652 (https://phabricator.wikimedia.org/T110256) [23:38:35] woo a fix! [23:38:38] Thanks for bringing it to our attention. [23:38:41] I am not going to be deploying other people's changes like this on tin just because you left it unmerged. [23:38:46] Krenair: Are you done SWATting BTW? [23:38:57] I only see fundraising/REL1_25 https://gerrit.wikimedia.org/r/#/q/owner:%22Cdentinger+%253Ccdentinger%2540wikimedia.org%253E%22+project:mediawiki/core,n,z [23:39:01] (03PS4) 10Dzahn: mailman: exim alias for discovery list renames [puppet] - 10https://gerrit.wikimedia.org/r/238652 (https://phabricator.wikimedia.org/T110256) [23:39:17] (03CR) 10Dzahn: [C: 032] mailman: exim alias for discovery list renames [puppet] - 10https://gerrit.wikimedia.org/r/238652 (https://phabricator.wikimedia.org/T110256) (owner: 10Dzahn) [23:39:22] If you merge something to a deployment branch in gerrit and then don't update tin for it, I will revert you. [23:39:25] hmm strange [23:39:33] https://gerrit.wikimedia.org/r/#/q/owner:%22Cdentinger+%253Ccdentinger%2540wikimedia.org%253E%22+project:mediawiki/extensions/DonationInterface+-branch:master,n,z also doesn't show anything [23:39:35] Krenair: but they aren't merging to any wmf* branches [23:39:35] As far as I know we deployed today just like we always do [23:39:45] Krenair: This is an accident--wmf main cluster deployment of that extension is supposed to come from master. [23:39:58] ebernhardson, gerrit does that part for them. [23:40:26] Krenair: looks like I already fixed this ;) https://gerrit.wikimedia.org/r/#/c/230705/ [23:40:35] !log renaming search mailing lists to discovery mailing lists [23:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:44] Krenair: Yes but why [23:41:05] I don't know and to be entirely honest, I don't care. [23:41:10] https://gerrit.wikimedia.org/r/#/c/240264/ [23:41:11] lol and it was the two of us who discovered the issue a month ago [23:41:15] It's the deployment branch, not the wmf* branch [23:41:28] Although maybe that is because of the make-wmf-branch config [23:41:46] Yeah let's please deescalate here: https://gerrit.wikimedia.org/r/#/c/230705/4 [23:41:59] RoanKattouw, by the way, yes, I dealt with the patches up for SWAT [23:42:08] I'm sorry that the config has been mess up for...ever, this must have been incredibly inconvenient for deployers all along. [23:42:32] OK [23:42:42] I'm gonna run a maintenance script on testwiki in ~5 mins then [23:43:07] awight: Everything around fundraising and deployment is very mysterious. Is this stuff documented anywhere? [23:43:21] lots of different places :) [23:43:27] wikitech? [23:43:28] !bash puppet isnt' so much of a language as it is an incantation phrase book powered by souls devoured thousands of years ago and given form by the heartache of opsen everywhere [23:43:32] it's scarily undocumented, yeah... here are our main entrypoints: http://mediawiki.org/wiki/Fundraising_tech [23:43:55] and http://wikitech.wikimedia.org/wiki/Fundraising and https://collab.wikimedia.org/wiki/Fundraising [23:44:12] collab is no good, this needs to be available to all deployers. [23:44:14] We're trying to move towards mediawiki.org for all new work [23:44:27] awight: Perhaps you and/or yours could write about this on the How_to_deploy_code page? [23:44:58] collab is supposed to be just for stuff that might help scammers that want to use us for stolen CC validation [23:45:04] Krenair: collab is for private information about stuff like fraud settings, we can't make these public. This is actually the root of our documentation rot, that it's difficult to get clarity on what can be public. [23:45:30] RoanKattouw: well, the situation we should be in is that there's zero impact of Fundraising work on mainstream deployment [23:46:39] ... there are documents about how to deploy to the fundraising cluster, but they are private & we don't want people randomly deploying without communicating with us directly. [23:46:58] ha... they can't. Minor issue of no frack permissions. [23:47:59] But still: Things that go out on the cluster (so, also the stuff on donate wiki) should be available somewhere people can see it. [23:48:00] awight, it's only documentation about how fundraising stuff affects proper deployments that needs to be available to deployers, whether it's public or not is irrelevant [23:48:20] awight's config.json patch looks like it'll do the trick - just needs rebase [23:49:16] Krenair: Good idea. I think there are some gotchas we should mention around CentralNotice. [23:53:47] Oh, right, CentralNotice has a special-snowflake deployment system [23:54:01] Quoting from my ops-l email from last week: [23:54:24] "* CentralNotice doesn't have any origin/wmf/* branches. Instead, it appears to follow the wmf_deploy branch. Why does CentralNotice need a special snowflake deployment branch setup? There are now a bunch of patches in CentralNotice deployed in wmf22 that aren't recorded in any branch, they're just sitting there. However, it seems like wmf23 runs a newer version of the wmf_deploy branch that... [23:54:26] ...does have these patches. IMO this is a bad strategy, because the actual deployed state of CentralNotice isn't recorded anywhere. If we had lost /srv/mediawiki-staging somehow, we would have been able to rebuild everything correctly from git (+any security patches), but not CentralNotice." [23:55:08] RoanKattouw: oh wow, thanks for making the connection to real consequences. [23:55:48] I was also auditing that directory for unrelated reasons and was confused by the non-standardness [23:56:13] What's your suggestion for developing on master but scheduling WMF cluster deployment? [23:56:34] Aren't wmf/* branches cut from master? [23:56:39] By default yes [23:56:47] But I think there are config options to do things differently [23:57:06] RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 64890 bytes in 3.001 second response time [23:57:07] I don't remember if we still have the flag for "copy from previous branch", but that would be useful here [23:57:15] RECOVERY - Apache HTTP on mw2187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.244 second response time [23:57:24] I would have expected that we would get wmf/* branches cut from wmf_deploy, which would satisfy our needs [23:57:35] Oh that too [23:57:39] Yeah [23:57:50] And then manage them explicitly when you need new stuff to be added there [23:57:50] Maybe this is just a release script tweak? [23:57:53] Probably [23:57:57] It might even already be supported [23:58:03] Or lmk if there's something we can configure better [23:58:06] in which case you just have to add the right magic to the json file [23:58:11] hmm [23:58:19] k I'll make a note to fix this [23:58:22] ISTR there's some sort of support for this but this was all years ago so I've forgotten