[00:43:35] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp4019_v6 [00:45:16] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [01:14:19] (03PS2) 10Gergő Tisza: Relocate the virtualenv to /srv/sentry [software/sentry] - 10https://gerrit.wikimedia.org/r/241259 [01:15:08] (03CR) 10Gergő Tisza: "Fixed some stray symlinks. (Not sure what needs to be set up to make this repo merge on +2...)" [software/sentry] - 10https://gerrit.wikimedia.org/r/241259 (owner: 10Gergő Tisza) [01:17:25] PROBLEM - puppet last run on restbase-test2002 is CRITICAL: CRITICAL: puppet fail [01:27:05] (03PS1) 10Gergő Tisza: Move Sentry packages into subclass [puppet] - 10https://gerrit.wikimedia.org/r/241575 [01:35:14] (03PS1) 10Thcipriani: Fix mkdir_p relative paths [tools/scap] - 10https://gerrit.wikimedia.org/r/241576 [01:44:26] RECOVERY - puppet last run on restbase-test2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:05:32] (03PS1) 10Gergő Tisza: Make Sentry config modular [puppet] - 10https://gerrit.wikimedia.org/r/241577 [02:05:34] (03PS1) 10Gergő Tisza: [WIP] Configure LDAP plugin [puppet] - 10https://gerrit.wikimedia.org/r/241578 [02:08:00] (03PS1) 10Base: Replaced hyphen with space in wgSitename for uawikimedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241579 [02:18:24] (03PS1) 10Tim Landscheidt: Tools: Remove dependency of toollabs::checker on toollabs::submit [puppet] - 10https://gerrit.wikimedia.org/r/241581 (https://phabricator.wikimedia.org/T113744) [02:20:22] !log l10nupdate@tin Synchronized php-1.26wmf24/cache/l10n: l10nupdate for 1.26wmf24 (duration: 06m 54s) [02:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:17] (03CR) 10Tim Landscheidt: [C: 04-1] "I missed the grid checks that need jobutils." [puppet] - 10https://gerrit.wikimedia.org/r/241581 (https://phabricator.wikimedia.org/T113744) (owner: 10Tim Landscheidt) [02:27:17] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf24) at 2015-09-28 02:27:16+00:00 [02:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:03] (03CR) 10Tim Landscheidt: "I missed that jobutils was already installed and the 503 was followed by a "NOT OK", i. e. the test itself worked, but failed. Several ot" [puppet] - 10https://gerrit.wikimedia.org/r/241581 (https://phabricator.wikimedia.org/T113744) (owner: 10Tim Landscheidt) [02:46:01] (03CR) 10Ori.livneh: "Looks fine, but what does it fix?" [puppet] - 10https://gerrit.wikimedia.org/r/241575 (owner: 10Gergő Tisza) [02:47:22] (03CR) 10Ori.livneh: Make Sentry config modular (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/241577 (owner: 10Gergő Tisza) [02:47:40] (03CR) 10Ori.livneh: [C: 032 V: 032] Relocate the virtualenv to /srv/sentry [software/sentry] - 10https://gerrit.wikimedia.org/r/241259 (owner: 10Gergő Tisza) [02:52:53] (03PS2) 10Ori.livneh: Move Sentry packages into subclass [puppet] - 10https://gerrit.wikimedia.org/r/241575 (owner: 10Gergő Tisza) [03:02:57] (03PS1) 10Tim Landscheidt: Tools: Unpuppetize host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/241582 (https://phabricator.wikimedia.org/T109485) [03:04:27] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:17:04] (03CR) 10Gergő Tisza: "The non-existent require in line 75. (I removed it anyway because it is unnecessary if the class is globally required. If there is some wa" [puppet] - 10https://gerrit.wikimedia.org/r/241575 (owner: 10Gergő Tisza) [03:25:40] (03PS2) 10Gergő Tisza: Make Sentry config modular [puppet] - 10https://gerrit.wikimedia.org/r/241577 [03:25:42] (03PS2) 10Gergő Tisza: [WIP] Configure LDAP plugin [puppet] - 10https://gerrit.wikimedia.org/r/241578 [03:26:15] (03PS3) 10Ori.livneh: Move Sentry packages into subclass [puppet] - 10https://gerrit.wikimedia.org/r/241575 (owner: 10Gergő Tisza) [03:26:48] (03PS4) 10Ori.livneh: Move Sentry packages into subclass [puppet] - 10https://gerrit.wikimedia.org/r/241575 (owner: 10Gergő Tisza) [03:26:56] (03CR) 10Ori.livneh: [C: 032 V: 032] Move Sentry packages into subclass [puppet] - 10https://gerrit.wikimedia.org/r/241575 (owner: 10Gergő Tisza) [03:27:36] RECOVERY - Disk space on labstore1002 is OK: DISK OK [04:02:47] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 11.54% of data above the critical threshold [100000000.0] [04:05:16] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:05:25] (03PS1) 10Gergő Tisza: sentry-auth fixes [puppet] - 10https://gerrit.wikimedia.org/r/241585 [04:21:34] (03CR) 1020after4: "Why not just use absolute symlinks?" [tools/scap] - 10https://gerrit.wikimedia.org/r/241576 (owner: 10Thcipriani) [04:26:42] (03CR) 10Ori.livneh: [C: 032] sentry-auth fixes [puppet] - 10https://gerrit.wikimedia.org/r/241585 (owner: 10Gergő Tisza) [04:30:36] (03PS3) 10Ori.livneh: Make Sentry config modular [puppet] - 10https://gerrit.wikimedia.org/r/241577 (owner: 10Gergő Tisza) [04:31:07] (03CR) 10Ori.livneh: [C: 032] Make Sentry config modular [puppet] - 10https://gerrit.wikimedia.org/r/241577 (owner: 10Gergő Tisza) [04:33:16] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:36:23] (03CR) 10Ori.livneh: "It hadn't occurred to me before, but another way this could be done is by performing the LDAP authentication at the Apache layer, as we co" [puppet] - 10https://gerrit.wikimedia.org/r/241578 (owner: 10Gergő Tisza) [04:48:07] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [04:56:16] (03PS3) 10Tim Starling: Update personal .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/238363 [04:57:33] TimStarling: we "compile" wikiversions.json to wikiversions.php now. Since the PHP file is just as readable, we should probably get rid of the JSON file altogheter. [05:00:45] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [05:00:46] so you could do something like this instead: php -r '$v=include("/srv/mediawiki/wikiversions.php");echo $v["enwiki"];' [05:02:15] (03PS3) 10Gergő Tisza: [WIP] Configure LDAP plugin [puppet] - 10https://gerrit.wikimedia.org/r/241578 [05:06:26] (03CR) 1020after4: [C: 032] Filter target host logging from stdout of main process [tools/scap] - 10https://gerrit.wikimedia.org/r/241256 (https://phabricator.wikimedia.org/T113779) (owner: 10Dduvall) [05:06:47] (03Merged) 10jenkins-bot: Filter target host logging from stdout of main process [tools/scap] - 10https://gerrit.wikimedia.org/r/241256 (https://phabricator.wikimedia.org/T113779) (owner: 10Dduvall) [05:07:13] (03CR) 1020after4: [C: 031] switch to git-based portal [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [05:52:34] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Sep 28 05:52:34 UTC 2015 (duration 52m 33s) [05:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:08:45] (03PS1) 10Ori.livneh: Rely on timeouts specified in php.ini rather than calling ini_set() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241588 [06:10:17] (03PS2) 10Ori.livneh: Rely on timeouts specified in php.ini rather than calling ini_set() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241588 [06:10:31] (03PS3) 10Ori.livneh: Rely on timeouts specified in php.ini rather than calling ini_set() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241588 [06:10:46] (03CR) 10Ori.livneh: [C: 032] Rely on timeouts specified in php.ini rather than calling ini_set() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241588 (owner: 10Ori.livneh) [06:10:52] (03Merged) 10jenkins-bot: Rely on timeouts specified in php.ini rather than calling ini_set() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241588 (owner: 10Ori.livneh) [06:11:39] !log ori@tin Synchronized wmf-config/CommonSettings.php: I179be4bd3: Rely on timeouts specified in php.ini rather than calling ini_set() (duration: 00m 17s) [06:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:36] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:30:57] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [06:31:16] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:36] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:06] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:26] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:46] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:27] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:36] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:56] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:15] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:56:26] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:56:45] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:07] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:57:15] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:36] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:06] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:35] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:05] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1680014 (10Joe) @Paladox @brion I think Ori might have found the problem and fixed it: we were setting an override on max_execution_time in mediawiki-config if not runn... [07:47:46] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1680015 (10Joe) a:3Joe [08:06:16] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100 [08:15:09] (03PS1) 10Hoo man: Add "nod" (Northern Thai) to wmgExtraLanguageNames for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241598 [08:17:25] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100 [08:31:55] (03Abandoned) 10Matthias Mullie: Whitelist Flow opt-in on user talkpage as BetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235003 (https://phabricator.wikimedia.org/T98270) (owner: 10Matthias Mullie) [08:35:33] Error: 503, Service Unavailable at Mon, 28 Sep 2015 08:35:12 GMT [08:35:34] Request: GET http://nl.wikipedia.org/wiki/Speciaal:Volglijst, from 10.20.0.103 via cp1066 cp1066 ([10.64.0.103]:3128), Varnish XID 868513435 Forwarded for: 82.72.122.85, 10.20.0.113, 10.20.0.113, 10.20.0.103 Error: 503, Service Unavailable at Mon, 28 Sep 2015 08:35:21 GMT [08:35:37] bad gateway etc. [08:36:29] mediawiki.org unresponsive giving 503s. [08:36:32] same here [08:36:35] PROBLEM - Apache HTTP on mw1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:36] PROBLEM - HHVM rendering on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:36] PROBLEM - Apache HTTP on mw1068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:37] PROBLEM - HHVM rendering on mw1210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:37] PROBLEM - HHVM rendering on mw1183 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:37] PROBLEM - Apache HTTP on mw1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:37] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:37] all wikis seems down [08:36:37] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:37] PROBLEM - HHVM rendering on mw1245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:38] PROBLEM - Apache HTTP on mw1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:43] _joe_? [08:36:45] PROBLEM - Apache HTTP on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:45] PROBLEM - Apache HTTP on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:45] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:45] PROBLEM - Apache HTTP on mw1108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:45] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:45] PROBLEM - Apache HTTP on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:46] PROBLEM - HHVM rendering on mw1220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:46] Yikes [08:36:46] PROBLEM - Apache HTTP on mw1069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:46] PROBLEM - Apache HTTP on mw1173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:47] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:47] PROBLEM - Apache HTTP on mw1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:48] Deskana^^ [08:36:48] uh oh [08:36:48] PROBLEM - Apache HTTP on mw1178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:58] hoo: ^ [08:36:59] PROBLEM - Apache HTTP on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:59] PROBLEM - Apache HTTP on mw1247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:00] PROBLEM - Apache HTTP on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:00] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:01] PROBLEM - HHVM rendering on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:01] PROBLEM - HHVM rendering on mw1212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:02] <_joe_> oh oh [08:37:04] <_joe_> looking [08:37:05] PROBLEM - Apache HTTP on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:05] PROBLEM - HHVM rendering on mw1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:05] PROBLEM - HHVM rendering on mw1112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:05] PROBLEM - HHVM rendering on mw1051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:05] PROBLEM - HHVM rendering on mw1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:05] PROBLEM - HHVM rendering on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:11] maybe related to ori@tin Synchronized wmf-config/CommonSettings.php: I179be4bd3: Rely on timeouts specified in php.ini rather than calling ini_set() (duration: 00m 17s) ??? [08:37:18] -> https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:31] <_joe_> no [08:38:14] meh [08:38:21] <_joe_> we're being DDOSed [08:38:51] i like the error page though. hadn't been able to see that version yet. [08:39:19] <_joe_> thedj: heh, thanks for seeing the bright side of it :P [08:39:23] PROBLEM - LVS HTTPS IPv4 on text-lb.codfw.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8969 bytes in 0.242 second response time [08:39:25] PROBLEM - HHVM rendering on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:25] PROBLEM - Mediawiki Apple Dictionary Bridge on terbium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:25] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:39:25] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:39:26] thedj: me too :) [08:39:36] RECOVERY - Apache HTTP on mw1183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.102 second response time [08:39:36] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.327 second response time [08:39:36] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.269 second response time [08:39:37] RECOVERY - HHVM rendering on mw1188 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 7.277 second response time [08:39:37] RECOVERY - Apache HTTP on mw1210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.376 second response time [08:39:37] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.944 second response time [08:39:45] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.062 second response time [08:39:45] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.691 second response time [08:39:45] PROBLEM - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8841 bytes in 0.021 second response time [08:39:49] RECOVERY - Apache HTTP on mw1215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.718 second response time [08:39:50] RECOVERY - HHVM rendering on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 0.603 second response time [08:39:50] RECOVERY - HHVM rendering on mw1163 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 0.275 second response time [08:39:50] PROBLEM - LVS HTTPS IPv6 on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8954 bytes in 0.229 second response time [08:39:54] RECOVERY - Apache HTTP on mw1188 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.197 second response time [08:39:54] RECOVERY - HHVM rendering on mw1218 is OK: HTTP OK: HTTP/1.1 200 OK - 70335 bytes in 3.189 second response time [08:39:55] RECOVERY - HHVM rendering on mw1174 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 0.559 second response time [08:39:55] RECOVERY - HHVM rendering on mw1215 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 0.664 second response time [08:39:55] RECOVERY - HHVM rendering on mw1181 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 0.973 second response time [08:39:55] RECOVERY - HHVM rendering on mw1213 is OK: HTTP OK: HTTP/1.1 200 OK - 70335 bytes in 1.481 second response time [08:39:55] RECOVERY - HHVM rendering on mw1187 is OK: HTTP OK: HTTP/1.1 200 OK - 70335 bytes in 2.702 second response time [08:39:56] RECOVERY - HHVM rendering on mw1171 is OK: HTTP OK: HTTP/1.1 200 OK - 70335 bytes in 2.939 second response time [08:39:56] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8896 bytes in 7.939 second response time [08:40:05] RECOVERY - Apache HTTP on mw1212 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.071 second response time [08:40:05] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.153 second response time [08:40:06] RECOVERY - HHVM rendering on mw1219 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 0.268 second response time [08:40:06] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.117 second response time [08:40:15] PROBLEM - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8884 bytes in 0.417 second response time [08:40:19] PROBLEM - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8900 bytes in 0.490 second response time [08:40:23] RECOVERY - HHVM rendering on mw1098 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 2.962 second response time [08:40:23] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.062 second response time [08:40:23] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.102 second response time [08:40:23] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.162 second response time [08:40:23] RECOVERY - HHVM rendering on mw1038 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 0.262 second response time [08:40:24] RECOVERY - HHVM rendering on mw1025 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 0.617 second response time [08:40:24] RECOVERY - HHVM rendering on mw1105 is OK: HTTP OK: HTTP/1.1 200 OK - 70341 bytes in 3.892 second response time [08:40:25] RECOVERY - HHVM rendering on mw1065 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 0.771 second response time [08:40:25] RECOVERY - HHVM rendering on mw1058 is OK: HTTP OK: HTTP/1.1 200 OK - 70342 bytes in 4.105 second response time [08:40:26] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 1.028 second response time [08:40:26] RECOVERY - HHVM rendering on mw1067 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 1.130 second response time [08:40:27] RECOVERY - Apache HTTP on mw1063 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.012 second response time [08:40:39] RECOVERY - HHVM rendering on mw1056 is OK: HTTP OK: HTTP/1.1 200 OK - 70343 bytes in 5.244 second response time [08:40:39] RECOVERY - HHVM rendering on mw1101 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 5.523 second response time [08:40:39] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.963 second response time [08:40:40] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8967 bytes in 3.520 second response time [08:40:43] RECOVERY - HHVM rendering on mw1084 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 5.772 second response time [08:40:43] RECOVERY - HHVM rendering on mw1062 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 5.946 second response time [08:40:43] RECOVERY - HHVM rendering on mw1054 is OK: HTTP OK: HTTP/1.1 200 OK - 70342 bytes in 4.559 second response time [08:40:43] RECOVERY - HHVM rendering on mw1100 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 6.089 second response time [08:40:43] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.102 second response time [08:40:44] RECOVERY - HHVM rendering on mw1039 is OK: HTTP OK: HTTP/1.1 200 OK - 70341 bytes in 0.146 second response time [08:40:44] RECOVERY - HHVM rendering on mw1150 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 0.268 second response time [08:40:45] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.595 second response time [08:40:45] RECOVERY - HHVM rendering on mw1043 is OK: HTTP OK: HTTP/1.1 200 OK - 70334 bytes in 0.301 second response time [08:41:00] nice [08:41:36] <_joe_> still DDOSing btw [08:41:42] <_joe_> just a less effective one [08:42:04] RECOVERY - Apache HTTP on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.850 second response time [08:42:05] RECOVERY - HHVM rendering on mw1179 is OK: HTTP OK: HTTP/1.1 200 OK - 70322 bytes in 6.504 second response time [08:42:05] RECOVERY - HHVM rendering on mw1170 is OK: HTTP OK: HTTP/1.1 200 OK - 70321 bytes in 6.553 second response time [08:42:07] RECOVERY - HHVM rendering on mw1209 is OK: HTTP OK: HTTP/1.1 200 OK - 70322 bytes in 7.899 second response time [08:42:07] RECOVERY - HHVM rendering on mw1166 is OK: HTTP OK: HTTP/1.1 200 OK - 70330 bytes in 7.986 second response time [08:42:07] RECOVERY - Apache HTTP on mw1033 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.727 second response time [08:42:07] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.914 second response time [08:42:15] RECOVERY - LVS HTTPS IPv6 on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15110 bytes in 0.591 second response time [08:42:23] RECOVERY - HHVM rendering on mw1073 is OK: HTTP OK: HTTP/1.1 200 OK - 70321 bytes in 6.243 second response time [08:42:23] RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15068 bytes in 4.207 second response time [08:42:27] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.777 second response time [08:42:28] RECOVERY - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 10531 bytes in 9.023 second response time [08:42:32] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.744 second response time [08:42:32] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.917 second response time [08:42:32] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.391 second response time [08:42:32] RECOVERY - Apache HTTP on mw1056 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.602 second response time [08:42:33] RECOVERY - Apache HTTP on mw1046 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.984 second response time [08:42:40] I see Fatal error: request has exceeded memory limit in hhvm.log, but not that many [08:42:44] RECOVERY - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10543 bytes in 0.481 second response time [08:42:48] RECOVERY - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 10572 bytes in 0.564 second response time [08:42:52] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15108 bytes in 0.666 second response time [08:42:55] Request: GET http://en.wikisource.org/, from 10.20.0.110 via cp1054 cp1054 ([10.64.32.106]:3128), Varnish XID 1638017241 [08:42:56] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15108 bytes in 0.646 second response time [08:42:57] Forwarded for: 80.176.129.180, 10.20.0.113, 10.20.0.113, 10.20.0.110 [08:42:58] Error: 503, Service Unavailable at Mon, 28 Sep 2015 08:42:11 GMT [08:43:00] RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15137 bytes in 0.715 second response time [08:43:03] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15136 bytes in 0.605 second response time [08:43:07] Down for maintance in Europe? [08:43:08] RECOVERY - HHVM rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 70321 bytes in 6.518 second response time [08:43:09] RECOVERY - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10543 bytes in 8.588 second response time [08:43:12] ShakespeareFan00: ddos [08:43:15] how many times this month? [08:43:18] RECOVERY - HHVM rendering on mw1243 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.091 second response time [08:43:19] RECOVERY - HHVM rendering on mw1251 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.107 second response time [08:43:19] RECOVERY - Apache HTTP on mw1244 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.029 second response time [08:43:19] RECOVERY - HHVM rendering on mw1240 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.089 second response time [08:43:27] spagewmf: Almost surely unrelated [08:43:28] RECOVERY - Apache HTTP on mw1250 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.022 second response time [08:43:28] RECOVERY - HHVM rendering on mw1244 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.092 second response time [08:43:28] RECOVERY - HHVM rendering on mw1255 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.099 second response time [08:43:29] RECOVERY - Apache HTTP on mw1243 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [08:43:30] RECOVERY - Apache HTTP on mw1252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.033 second response time [08:43:30] sjoerddebruin: Who did Wikipedia upset this time? [08:43:30] RECOVERY - Apache HTTP on mw1239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.026 second response time [08:43:30] RECOVERY - HHVM rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.111 second response time [08:43:33] The profiler stuff is known to be running OOM [08:43:34] <_joe_> ShakespeareFan00: nope, a ddos [08:43:38] RECOVERY - Apache HTTP on mw1236 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.035 second response time [08:43:38] RECOVERY - HHVM rendering on mw1236 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.093 second response time [08:43:39] RECOVERY - HHVM rendering on mw1248 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.092 second response time [08:43:39] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [500.0] [08:43:40] RECOVERY - HHVM rendering on mw1250 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.087 second response time [08:43:40] RECOVERY - Apache HTTP on mw1253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.026 second response time [08:43:40] RECOVERY - HHVM rendering on mw1257 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.105 second response time [08:43:40] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15096 bytes in 0.137 second response time [08:43:48] RECOVERY - HHVM rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.111 second response time [08:43:58] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [08:43:58] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [08:43:58] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.049 second response time [08:43:58] RECOVERY - HHVM rendering on mw1245 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.087 second response time [08:43:58] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.031 second response time [08:43:59] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [08:43:59] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.046 second response time [08:43:59] <_joe_> gavel: too many [08:44:00] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.039 second response time [08:44:00] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [08:44:01] RECOVERY - Apache HTTP on mw1108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [08:44:01] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [08:44:02] RECOVERY - Apache HTTP on mw1173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.035 second response time [08:44:18] _joe_: WMF sue those idiots [08:44:18] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [08:44:18] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [08:44:18] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [08:44:18] RECOVERY - HHVM rendering on mw1217 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.107 second response time [08:44:18] RECOVERY - HHVM rendering on mw1172 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.107 second response time [08:44:19] RECOVERY - HHVM rendering on mw1184 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.116 second response time [08:44:19] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 14662 bytes in 0.059 second response time [08:44:22] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.042 second response time [08:44:23] RECOVERY - Apache HTTP on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.031 second response time [08:44:23] RECOVERY - LVS HTTPS IPv6 on mobile-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10542 bytes in 0.315 second response time [08:44:27] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [08:44:27] RECOVERY - Apache HTTP on mw1249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [08:44:27] RECOVERY - HHVM rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.092 second response time [08:44:27] RECOVERY - HHVM rendering on mw1212 is OK: HTTP OK: HTTP/1.1 200 OK - 70320 bytes in 0.096 second response time [08:44:27] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [08:44:28] RECOVERY - HHVM rendering on mw1081 is OK: HTTP OK: HTTP/1.1 200 OK - 70321 bytes in 0.126 second response time [08:44:28] RECOVERY - HHVM rendering on mw1112 is OK: HTTP OK: HTTP/1.1 200 OK - 70321 bytes in 0.127 second response time [08:44:29] RECOVERY - HHVM rendering on mw1096 is OK: HTTP OK: HTTP/1.1 200 OK - 70321 bytes in 0.123 second response time [08:44:29] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 70321 bytes in 0.125 second response time [08:44:30] RECOVERY - HHVM rendering on mw1051 is OK: HTTP OK: HTTP/1.1 200 OK - 70321 bytes in 0.126 second response time [08:44:30] RECOVERY - HHVM rendering on mw1066 is OK: HTTP OK: HTTP/1.1 200 OK - 70321 bytes in 0.122 second response time [08:44:59] <_joe_> gavel: who? the poor people who got themselves infected with some botnet? It's not that easy [08:45:06] <_joe_> sadly [08:45:17] Speaking of which? [08:45:20] _joe_: it's never easy [08:45:31] Is it possible to check what IP ranges are overloading? [08:46:02] because it's an ISP issues as much as it is a user one [08:46:37] ShakespeareFan00: we've been looking into the locations yes [08:46:59] <_joe_> ShakespeareFan00: it's a botnet, this much I can tell [08:47:09] Surely if it's proven as a botnet the error message should say this? [08:47:18] (or is that not definitively proven?) [08:49:59] <_joe_> ShakespeareFan00: I'm not sure I follow [08:50:25] <_joe_> the error message you get is from the mediawiki application, simply not handling the load [08:50:30] Ah OK [08:50:32] Thanks [08:50:45] <_joe_> so the error message can't tell you that :) [08:51:25] I can't honsetly think why someone would DDoS Wikimedia... [08:51:31] Unless it for lulz :( [08:51:43] <_joe_> well, big lulz :P [08:52:20] Criminlulz XD (Sorry I know it's in poor taste) [08:52:28] :( [08:55:00] operations, we <3 you [08:55:30] +1 [08:56:04] ShakespeareFan00: yep, criminlulz, as per Computer Fraud and Abuse Act [08:56:14] but as _joe_ says, difficult to prosecute [09:00:43] which of these pictures most closely resembles Wikimedia's anti-botnet network operations center, https://www.google.com/search?q=network+operations+center&tbm=isch ? [09:01:52] looks like our 112 dispatch center heh [09:02:17] <_joe_> spagewmf: https://media3.giphy.com/media/D0EjguuQzYr9m/200_s.gif [09:03:59] spagewmf: I though Wikimedia's data farm was in an ex bunker? [09:04:01] XD [09:04:15] So it looks like something out of an action thriller... [09:04:17] :) [09:04:28] :/ [09:15:29] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:17:15] 6operations: How to page when a host is down? - https://phabricator.wikimedia.org/T113834#1680112 (10fgiunchedi) I agree with @dzahn, it seems what we need is higher level checks for labs networking functioning and page on those. In general I think we're better off having pages for high level service status or I... [09:19:23] <_joe_> for whoever's interested, http://ganglia.wikimedia.org/latest/stacked.php?m=HHVM.load&c=Application%20servers%20eqiad&r=hour&st=1443431754&host_regex= [09:19:37] <_joe_> is a good measure of how huge the traffic was :) [09:20:40] 6operations: How to page when a host is down? - https://phabricator.wikimedia.org/T113834#1680118 (10mark) Agreed, define a high level test which checks Labs networking and is not dependent on a single node being up. [09:23:49] RECOVERY - Disk space on ms-be1012 is OK: DISK OK [09:25:09] PROBLEM - RAID on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:18] PROBLEM - Disk space on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:19] PROBLEM - DPKG on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:49] PROBLEM - dhclient process on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:59] PROBLEM - salt-minion processes on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:26:00] PROBLEM - configured eth on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:26:00] PROBLEM - puppet last run on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:58] RECOVERY - RAID on ms-be1012 is OK: OK: optimal, 13 logical, 13 physical [09:32:43] 6operations, 10ops-eqiad: ms-be1012: slot=5 dev=sdf failed - https://phabricator.wikimedia.org/T113929#1680137 (10fgiunchedi) 3NEW [09:34:24] ACKNOWLEDGEMENT - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi T113929 [09:47:17] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1680176 (10ArielGlenn) [09:56:58] (03PS1) 10ArielGlenn: dumps: fall back to php5 instead of hhvm for now [puppet] - 10https://gerrit.wikimedia.org/r/241612 [09:57:36] (03CR) 10ArielGlenn: [C: 04-1] "do not merge right now" [puppet] - 10https://gerrit.wikimedia.org/r/241612 (owner: 10ArielGlenn) [09:58:46] <_joe_> apergos: do you need assistance with HHVM? [09:58:52] <_joe_> if so, please ping me [09:59:20] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [09:59:29] _joe_: it's a build issue; I think you are already doing a new build yes? [09:59:36] ah see um [09:59:47] <_joe_> what's the bug you are experiencing? [10:00:12] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1680209 (10ArielGlenn) snapshot1002 was reinstalled with trusty last week. The utf8 normal extension had already been rebuilt by ori for trusty, so I figured we were... [10:00:18] https://phabricator.wikimedia.org/T113932 [10:00:19] this [10:01:04] Jamesofur: wrt EP namespace protection, have you already tested it? [10:01:19] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [10:01:20] <_joe_> uhm yeah I'll take a look at how to enable it [10:01:32] so if that can't get into the new build in time for the october run, we can just fall back for that run [10:01:45] if it does then I don't merge it. all good [10:01:57] I realize oct is only 2 days from now [10:02:40] there will be a config setting I'll likely have to change too but can't tell that til we have the feature enabled [10:03:31] <_joe_> apergos: I'll try to look into it today [10:03:52] ok [10:04:15] I looked around in our deb repo and meh. couldn't find dick-all :-D [10:04:27] thing takes forever to git clone too [10:04:28] (03CR) 10MarcoAurelio: "Specially, do not merge before Ib1674524 is at least resolved." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [10:04:57] (03CR) 10MarcoAurelio: "Specially, do not merge before Ib1674524 is at least resolved." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236418 (https://phabricator.wikimedia.org/T111630) (owner: 10MarcoAurelio) [10:05:34] <_joe_> apergos: thing is *huge* [10:05:42] I noticed! [10:05:52] took 5 retries before I got it to complete [10:08:39] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [10:09:15] (03PS11) 10Filippo Giunchedi: cassandra: WIP support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) [10:10:29] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [10:18:46] 6operations, 10Datasets-General-or-Unknown, 5Patch-For-Review: Add App Guidelines on Dumps Page - https://phabricator.wikimedia.org/T110742#1680284 (10ArielGlenn) hello, @Krenair or @VBaranetsky, can I get a thumbs up/down on the old change, and then on the new change? [10:23:08] PROBLEM - NTP on pybal-test2003 is CRITICAL: NTP CRITICAL: No response from NTP server [10:36:56] (03PS1) 10ArielGlenn: stop dumping interwiki table (T103589) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/241618 [10:40:05] (03CR) 10ArielGlenn: [V: 032] stop dumping interwiki table (T103589) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/241618 (owner: 10ArielGlenn) [10:40:19] (03CR) 10ArielGlenn: [C: 032] stop dumping interwiki table (T103589) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/241618 (owner: 10ArielGlenn) [10:42:11] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1680325 (10TheDJ) Might be better for some cases, but the Lila Tetrikov file from T113532 still appears to have troubles. Possibly that's another issue and the ticket... [10:43:54] <_joe_> apergos: quite interestingly, it seems an HHVM bug [10:44:05] <_joe_> so you should report it upstream, probably [10:44:10] 6operations, 10Dumps-Generation: sql dump schemata - seven tables should have their columns reordered - https://phabricator.wikimedia.org/T103583#1680329 (10ArielGlenn) [10:44:14] oh joy [10:44:22] well that's sadness [10:44:58] can I add you as a subscriber on the upstream bug report? [10:45:14] (they may ask how we build and I won't have any idea) [10:45:40] <_joe_> apergos: your bug happens with their vanilla package [10:46:07] oh nice [10:46:23] maybe I should test with their latest version? [10:46:45] <_joe_> I am [10:49:19] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [10:51:09] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [10:52:15] * apergos bites fingernails [10:56:24] (03CR) 10Alex Monk: [C: 04-2] "That's a poor workaround, and is only going to happen because of the issues involved in completely chucking this extension out of producti" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [10:57:03] (03CR) 10Alex Monk: "Copying from the other commit:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236418 (https://phabricator.wikimedia.org/T111630) (owner: 10MarcoAurelio) [10:57:09] (03CR) 10Alex Monk: [C: 04-2] Enable Extension:EducationProgram on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236418 (https://phabricator.wikimedia.org/T111630) (owner: 10MarcoAurelio) [11:02:56] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1680365 (10Krenair) >>! In T94277#1679736, @Reedy wrote: >>>! In T94277#1679712, @JeroenDeDauw wrote: >> September itself is about to be gone... > > We've only just got... [11:04:17] 6operations, 10Datasets-General-or-Unknown, 5Patch-For-Review: Add App Guidelines on Dumps Page - https://phabricator.wikimedia.org/T110742#1680366 (10Krenair) This is waiting for @VBaranetsky to comment, not me... [11:09:25] 6operations, 10Dumps-Generation: sql dump schemata - seven tables should have their columns reordered - https://phabricator.wikimedia.org/T103583#1680378 (10jcrespo) @wpmirrordev Sorry, I do not fully understand "fields defined in the wrong position" (proabably it is just me). Do you mean that [[ https://git... [11:11:29] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [11:11:30] _joe_: with their latest version, what's the result? [11:11:48] <_joe_> apergos: pasted in the ticket [11:11:54] ah thanks! [11:13:28] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [11:15:34] (03Abandoned) 10MarcoAurelio: Enable Extension:EducationProgram on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236418 (https://phabricator.wikimedia.org/T111630) (owner: 10MarcoAurelio) [11:16:09] (03Abandoned) 10MarcoAurelio: Enable Education Program extension at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [11:19:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 88 data above and 7 below the confidence bounds [11:20:36] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: assess impact of many cassandra seed nodes with multi instance - https://phabricator.wikimedia.org/T113939#1680393 (10fgiunchedi) 3NEW a:3fgiunchedi [11:21:05] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: assess impact of many cassandra seed nodes with multi instance - https://phabricator.wikimedia.org/T113939#1680402 (10fgiunchedi) also note that at the moment the firewall ACLs are based on seed nodes [11:24:20] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [11:25:37] "Server answer" is an SSH error message? [11:26:09] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [11:28:56] 6operations: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1680431 (10ArielGlenn) [11:30:48] 6operations: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1588735 (10ArielGlenn) [11:33:12] 6operations: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1680465 (10ArielGlenn) [11:35:02] (03PS1) 10ArielGlenn: dumps: move many classes out of worker.py into separate modules [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/241626 [11:36:29] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail [11:49:10] (03PS5) 10BBlack: improve XFF/XFP/XRIP code in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/240582 [11:50:30] (03CR) 10BBlack: [C: 032] improve XFF/XFP/XRIP code in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/240582 (owner: 10BBlack) [11:55:07] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1680525 (10Paladox) It now works for https://commons.wikimedia.org/wiki/File:Wikimania_2014_-_Technology_VI_-_Views_-_FastCCI.webm [11:55:24] 6operations: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1680527 (10ArielGlenn) related to restarts but not a blocker, still worth considering: T29125 [11:55:29] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1680529 (10Paladox) Thankyou @Joe and @Ori for fixing the problem. [12:02:38] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:14:54] 6operations, 10Dumps-Generation: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1680573 (10ArielGlenn) [12:16:28] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [12:19:36] 6operations, 10Dumps-Generation, 7Tracking: staged dumps implementation - https://phabricator.wikimedia.org/T107757#1680600 (10ArielGlenn) [12:20:37] 6operations, 10Dumps-Generation: redo dumps monitor so it runs as a service - https://phabricator.wikimedia.org/T110888#1680604 (10ArielGlenn) [12:20:56] 6operations, 10Dumps-Generation: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1680607 (10ArielGlenn) [12:21:59] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [12:27:30] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [12:33:19] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [12:40:39] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [12:40:51] (03PS2) 10ArielGlenn: for wmf reimage script, don't rotate saltmaster aes key on minion key deletion [puppet] - 10https://gerrit.wikimedia.org/r/238164 [12:41:32] (03PS1) 10BBlack: varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 [12:42:29] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [12:47:24] is jenkins working for anyone else? [12:48:26] I think it's a double agent yes [12:48:46] cute [12:49:24] but the point is I rebased a change, a 2 line change, on a script that's not that long and it's still not given me a verify [12:50:07] it is stalled [12:50:12] too many patches to process at once [12:50:12] https://integration.wikimedia.org/zuul/status [12:50:31] I am trying to find out why it does not run some jobs :/ [12:51:39] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [12:52:53] ok, thanks for the heads up [12:53:36] that's a hella long list [12:55:42] I think it is all because gate-and-submit takes precedence over all other pipelines [12:57:09] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [12:59:50] apergos: hey [13:00:01] apergos: we've been getting this: [13:00:14] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: assess impact of many cassandra seed nodes with multi instance - https://phabricator.wikimedia.org/T113939#1680654 (10mobrovac) One seed instance per hardware node should do it, as AFAIK the seed nodes are used to initialise Cassandra's DHT.... [13:00:15] # cat /var/log/exim4/paniclog [13:00:15] 2015-09-24 23:17:07 Failed to get user name for uid 995 [13:00:15] 2015-09-25 23:17:08 Failed to get user name for uid 995 [13:00:15] 2015-09-26 23:17:07 Failed to get user name for uid 995 [13:00:15] 2015-09-27 23:17:07 Failed to get user name for uid 995 [13:00:38] on fluorine [13:00:39] which host? and I'll look at it right after meeting (in one now) [13:00:41] ok I'll hunt [13:00:42] role::logging::mediawiki is applied there [13:00:43] thanks [13:00:49] and there is: [13:00:49] cron { 'rsync_slow_parse': [13:00:50] command => '/usr/bin/rsync -rt /a/mw-log/archive/slow-parse.log*.gz dumps.wikimedia.org::slow-parse/', [13:00:53] hour => 23, [13:00:55] minute => 15, [13:00:58] environment => 'MAILTO=ops-dumps@wikimedia.org', [13:00:58] user => 'datasets', [13:01:05] so this fits that timeframe quite accurately [13:01:16] and also I remember a discussion around the "datasets" uid [13:01:21] so it's possibly related? [13:16:59] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1680723 (10mark) p:5Normal>3High [13:19:29] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [13:29:40] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: assess impact of many cassandra seed nodes with multi instance - https://phabricator.wikimedia.org/T113939#1680778 (10Eevans) > when fully deployed we'll likely have ~3-4x the number of cassandra jvms running we have now, ATM every jvm is al... [13:29:53] paravoid: I will look at that but not actually in the next five minutes, I lied. I need to eat. then immediately after, I will look at that indeed! thanks [13:30:09] :) [13:30:11] bon appetit [13:30:28] thanks! bbr getting food [13:33:21] (03PS1) 10Faidon Liambotis: Set eqiad's admin_state to "down" [dns] - 10https://gerrit.wikimedia.org/r/241647 [13:33:56] bblack: ^ [13:34:00] mark: ^ too I guess :) [13:34:14] (coordinated/scheduled to be deployed in 30') [13:34:23] :) [13:34:28] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [13:35:13] (03PS1) 10Hoo man: Explicitly set wmgUseWikibaseQualityExternalValidation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241649 [13:36:51] (03CR) 10BBlack: [C: 031] Set eqiad's admin_state to "down" [dns] - 10https://gerrit.wikimedia.org/r/241647 (owner: 10Faidon Liambotis) [13:37:16] (03CR) 10Filippo Giunchedi: varnish: misspass limiter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [13:40:47] 6operations, 10RESTBase, 10RESTBase-Cassandra: column family cassandra metrics size - https://phabricator.wikimedia.org/T113733#1680817 (10Eevans) > I think we can trim the list of derived metrics to the most relevant ones, e.g. 50/75/95/99 percentile, count, 1MinuteRate I still can't help but wish that we... [13:40:52] (03CR) 10BBlack: varnish: misspass limiter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [13:41:40] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [13:43:30] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [13:43:57] (03CR) 10Mark Bergsma: varnish: misspass limiter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [13:46:46] (03CR) 10BBlack: varnish: misspass limiter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [13:47:57] bblack: i figured, but I think if we're gonna have the same subs concatenated _within the same file_ as well it's gonna get reallly confusing ;) [13:48:28] fair [13:48:54] if you do that anyway, at least put a very clear comment [13:49:00] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [13:49:19] in soviet varnish, concatened subs call *you* [13:49:36] lol [13:52:04] (03CR) 10JanZerebecki: [C: 031] Explicitly set wmgUseWikibaseQualityExternalValidation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241649 (owner: 10Hoo man) [13:53:11] (03PS2) 10BBlack: varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 [13:53:47] (03PS3) 10BBlack: varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 [13:56:20] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [13:56:46] jouncebot: next [13:56:57] hmmm :P [13:57:07] (03CR) 10JanZerebecki: [C: 031] Add "nod" (Northern Thai) to wmgExtraLanguageNames for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241598 (owner: 10Hoo man) [13:57:52] it's not on the calendar page, probably greg though I'd do that and I thought the opposite [13:57:55] but whatever :) [14:00:03] (03CR) 10Faidon Liambotis: [C: 032] Set eqiad's admin_state to "down" [dns] - 10https://gerrit.wikimedia.org/r/241647 (owner: 10Faidon Liambotis) [14:00:22] !log running failover eqiad->codfw test for all frontend traffic [14:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:01:20] now i miss my old reqstats [14:02:09] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Text+caches+codfw&m=cpu_report&s=by+name&mc=2&g=network_report [14:02:35] hockey stick, someone start a conspiracy [14:02:38] heh [14:08:40] (03CR) 10Anomie: "The character being replaced is an underscore (_), not a hyphen (-)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241579 (owner: 10Base) [14:12:19] paravoid: fixed up: datasets crontab owned by wrong user (why wouldn't puppet fix that? but anyways). ran a check across the cluster, no other such cases [14:12:38] awesome, thank you [14:12:48] thanks for the report [14:16:59] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [14:17:13] (03CR) 10ArielGlenn: [C: 032] for wmf reimage script, don't rotate saltmaster aes key on minion key deletion [puppet] - 10https://gerrit.wikimedia.org/r/238164 (owner: 10ArielGlenn) [14:18:22] traffic shift looks pretty stable now [14:18:46] 10 minutes for the ramp over the DNS TTLs, and it's been nearly 10 mins since that point and relatively-flat [14:18:49] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [14:18:52] yup [14:19:21] 6operations, 7Database: defragment db1015, db1035 and db1027 - https://phabricator.wikimedia.org/T110504#1680928 (10jcrespo) [14:19:22] paravoid: we had some spike of 500 errors according to griffin https://grafana.wikimedia.org/#/dashboard/db/varnish-http-errors [14:19:35] griffin? [14:19:51] grr [14:19:52] hashar: you mean the stuff back around 08:30 UTC? [14:19:56] apple auto corrects [14:20:12] bblack: ah yeah sorry. Forgot to look at the time scale. Forget me [14:21:42] 6operations, 7Database: New hardware for production core mysql cluster - https://phabricator.wikimedia.org/T106847#1680932 (10jcrespo) p:5Normal>3High s3 is in critical state (see T110504), send procurement soon. [14:21:51] don't see much shift on perf.wm.o from codfw, I guess the latency diff between the two for most clients it's really significant [14:22:05] maybe for logged-in we'd see a bit [14:24:25] \o/ [14:24:52] ✓ [14:24:59] haha [14:25:06] {{done}} [14:28:19] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [14:29:12] well that was boring [14:29:20] 6operations, 10Salt, 3Discovery-Maps-Sprint: Kartotherian git deploy service restart failed with perm error - https://phabricator.wikimedia.org/T112707#1680951 (10ArielGlenn) [14:30:59] good job guys :) [14:32:18] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [14:35:46] (03PS1) 10ArielGlenn: add fluorine to dataset rsync clients (was missing) [puppet] - 10https://gerrit.wikimedia.org/r/241653 [14:37:05] (03CR) 10ArielGlenn: [C: 032] add fluorine to dataset rsync clients (was missing) [puppet] - 10https://gerrit.wikimedia.org/r/241653 (owner: 10ArielGlenn) [14:37:49] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [14:40:29] mark: something like that https://grafana.wikimedia.org/#/dashboard/db/varnish-traffic ? [14:40:57] probably inaccurate, that graph the diamond rx_bit/tx_bit from any eth cards of the cp* boxes [14:40:58] bblack: ^^ [14:41:30] maybe they can be stacked somehow [14:43:57] hashar: well it's not really an accurate view of what you'd think it is, to look at raw net traffic on cp* [14:44:15] we still look at it, but you have to remember that that traffic is in both directions at each site, and the tiering, etc [14:44:52] e.g. mbps on the interface of an eqiad cache includes both the direct eqiad user traffic to that cache, and also the traffic that esams cache misses backend into eqiad, and also the cache's traffic back inside to the applayer. [14:45:37] yup [14:45:42] yeah that is very rough [14:46:43] (03CR) 10Eevans: "(still )LGTM overall, insofar as I'm a judge of the Puppet-ness(, but a couple questions in-line)." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [14:57:54] 6operations, 10Salt: check usage of salt-key delete everywhere - https://phabricator.wikimedia.org/T112534#1681034 (10ArielGlenn) 5Open>3Resolved changeset merged, this is done. [14:59:13] (03PS1) 10Faidon Liambotis: Remove role mail::mx from polonium/lead [puppet] - 10https://gerrit.wikimedia.org/r/241657 [15:01:08] 6operations, 10procurement: Return polonium/lead to spares - https://phabricator.wikimedia.org/T113962#1681038 (10faidon) 3NEW [15:01:28] (03CR) 10Faidon Liambotis: [C: 032] Remove role mail::mx from polonium/lead [puppet] - 10https://gerrit.wikimedia.org/r/241657 (owner: 10Faidon Liambotis) [15:02:28] PROBLEM - puppet last run on lead is CRITICAL: CRITICAL: Puppet last ran 5 days ago [15:04:19] RECOVERY - puppet last run on lead is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:04:30] jouncebot, next [15:04:30] In 4 hour(s) and 55 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150928T2000) [15:04:38] No SWAT? [15:04:43] I was just wondering about that [15:04:50] Oh, well, we're 4 minutes late for SWAT. [15:04:54] It's scheduled but jouncebot didn't announce it [15:04:54] Is there nothing in it? [15:04:57] Ah. [15:05:00] PROBLEM - puppet last run on polonium is CRITICAL: CRITICAL: Puppet last ran 5 days ago [15:05:13] Is it broken by Krinkle's messing with the page? [15:05:35] hoo, around? [15:05:59] Krenair: Yes [15:06:07] Do you want to SWAT? [15:07:28] PROBLEM - Exim SMTP on polonium is CRITICAL: Connection refused [15:07:29] PROBLEM - Exim SMTP on lead is CRITICAL: Connection refused [15:07:34] hoo, wmgUseWikibaseQualityExternalValidation is not actually being used anywhere? [15:07:43] ignore those, just nagios lagging behind [15:07:57] Krenair: Not yet, yes [15:08:48] RECOVERY - puppet last run on polonium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:08:56] So what's the point in adding it? [15:09:09] PROBLEM - spamassassin on polonium is CRITICAL: PROCS CRITICAL: 0 processes with args spamd [15:09:51] Krenair: We want to have it there in advance [15:09:58] why? [15:10:02] I can push it out myself, if you don't feel like it [15:10:02] why not [15:10:11] (03CR) 10BBlack: [C: 04-1] LVS: git-ssh service for phab backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/240600 (owner: 10Rush) [15:10:23] (03CR) 10Alex Monk: [C: 032] Explicitly set wmgUseWikibaseQualityExternalValidation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241649 (owner: 10Hoo man) [15:10:31] (03Merged) 10jenkins-bot: Explicitly set wmgUseWikibaseQualityExternalValidation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241649 (owner: 10Hoo man) [15:11:04] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/241649/ (duration: 00m 17s) [15:11:08] hoo, ^ [15:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:23] (03PS9) 10coren: webservicemonitor: some improvements [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/239377 (https://phabricator.wikimedia.org/T109362) [15:11:24] ugh git-ssh sounds horrible :( [15:11:28] Krenair: Thanks [15:11:31] why not just "git"? [15:11:38] PROBLEM - spamassassin on lead is CRITICAL: PROCS CRITICAL: 0 processes with args spamd [15:11:42] hoo, https://gerrit.wikimedia.org/r/#/c/241598/1/wmf-config/InitialiseSettings.php on my screen just shows ᨣᩴᩤᨾᩮᩥᩬᨦ [15:12:08] paravoid: because we might want https://git.wikimedia.org and ssh:git-ssh.wikimedia.org to be two different IPs? [15:12:15] why? [15:12:39] one's for ssh, one's for web view or whatever. are they the same software stack always? [15:12:57] no, but it's going over LVS anyway, right? [15:13:01] oh also, git-ssh will probably be single-datacenter to the primary, whereas the web view would be via misc-web [15:13:08] so there's a geoip difference there [15:13:22] (03PS3) 10Rush: LVS: git-ssh service for phab backend [puppet] - 10https://gerrit.wikimedia.org/r/240600 [15:13:30] Krenair: That's the name of the language in the language itself, yes [15:14:01] I have no love for git-ssh but git.wm is not avail atm, open to suggestion [15:14:07] unless we solve the traffic sec issues and have e.g. esams terminate git.wikimedia.org ssh by sending it directly to eqiad without an applayer proxy. [15:14:13] plan is to point git.wm to same IP when it is in my mind [15:14:22] hoo, maybe it's showing properly for you [15:14:27] bblack: how do you mean? [15:14:41] Krenair: Maybe check it out and diff it on the cli? [15:14:45] yeah [15:14:45] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:14:49] we don't currently have a good plan for an applayer port 22 proxy [15:15:19] if git.wikimedia.org for HTTPS goes into misc-web, which will be all-datacenters (when other blockers resolved) as a normal 2-tier cache... [15:15:53] (03PS1) 10coren: Disable NFS lookup cache on NFS client instances [puppet] - 10https://gerrit.wikimedia.org/r/241663 (https://phabricator.wikimedia.org/T106170) [15:16:02] then that implies if port 22 was also on the git.wm.o IP/hostname, how does lvs300x route that traffic directly to the phab ssh server that only lives in eqiad or codfw? [15:16:28] unless we go deploy a generic TCP connection forwarded on cp machines I guess [15:16:30] (03PS4) 10Rush: LVS: git-ssh service for phab backend [puppet] - 10https://gerrit.wikimedia.org/r/240600 [15:16:31] well [15:16:38] 1) we could do that [15:16:39] *forwarder, but then we lose the client source IP [15:16:45] 2) we could also explore non-LVS-DR too [15:16:48] unless we have a real proxy solution [15:16:52] hoo, [15:16:58] LVS-TUN maybe [15:17:22] that's what we need, more new work to do on pybal and LVS :) [15:17:25] but 0) do we really need git.wm.org on misc-web? :) how are we going to invalidate commit logs etc.? [15:17:51] we really want everything on misc-web that doesn't have a good reason not to be there for redundancy or outage-handling or whatever [15:18:00] this does seem to have a good reason ;) [15:18:05] because managing 40 different TLS termination softwares and security patches and whatnot sucks [15:18:44] Krenair: let me check [15:18:52] (03CR) 10ArielGlenn: "not ready for merge yet, a bug to be worked out" [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/204964 (https://phabricator.wikimedia.org/T94457) (owner: 10GWicke) [15:19:20] (invalidation isn't generally an issue, right? we set it to bypass if the software doesn't give CC headers. hopefully in good cases they can send CC headers that most content is uncacheable, but e.g. images/css is) [15:19:42] it's not so much about the caches as just the public termination being standardized [15:21:27] (03CR) 1020after4: [C: 031] Fix mkdir_p relative paths [tools/scap] - 10https://gerrit.wikimedia.org/r/241576 (owner: 10Thcipriani) [15:21:53] https://wikitech.wikimedia.org/wiki/Httpsless_domains [15:21:58] https://wikitech.wikimedia.org/wiki/HTTPS/domains [15:22:14] 6operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash: Set up a service IP for logstash - https://phabricator.wikimedia.org/T113104#1681098 (10GWicke) @bd808, the source hash LVS method would assign the backend by source IP, so different fragments from the same producer would end up at the same backend. [15:22:22] ^ basically I hate that list of things to keep track of being bigger than it needs to be, that's why I like defaulting to misc-web unless there's a really strong reason [15:22:40] (03CR) 10ArielGlenn: "I need to do some careful testing. See https://phabricator.wikimedia.org/T113964" [puppet] - 10https://gerrit.wikimedia.org/r/219134 (owner: 10ArielGlenn) [15:22:44] Krenair: Looks ok [15:23:03] yeah ok, that's fair, but it kinda sucks to make architectural details like those visible to end-users [15:23:11] but ok, I don't feel that strongly about it [15:23:31] I don't think it's unreasonable to have two hostnames here for git-web and git-ssh, and that doesn't necessarily reflect what we're doing internally. [15:23:38] we could map them to the same IP later if it makes sense [15:24:45] (03PS2) 10Alex Monk: Add "nod" (Northern Thai) to wmgExtraLanguageNames for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241598 (https://phabricator.wikimedia.org/T93880) (owner: 10Hoo man) [15:25:04] 6operations: How to page when a host is down? - https://phabricator.wikimedia.org/T113834#1681111 (10Andrew) > I think as long as you do _not_ define any service dependencies [http://docs.icinga.org/latest > /en/dependencies.html] nothing will be supressed by another check. So, if the host is down, naturally al... [15:25:43] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: assess impact of many cassandra seed nodes with multi instance - https://phabricator.wikimedia.org/T113939#1681112 (10GWicke) I don't think the seed / non-seed distinction matters beyond bootstrap. In normal operation, all that should matter... [15:25:45] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [15:27:23] (03CR) 10Alex Monk: [C: 032] Add "nod" (Northern Thai) to wmgExtraLanguageNames for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241598 (https://phabricator.wikimedia.org/T93880) (owner: 10Hoo man) [15:27:30] (03Merged) 10jenkins-bot: Add "nod" (Northern Thai) to wmgExtraLanguageNames for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241598 (https://phabricator.wikimedia.org/T93880) (owner: 10Hoo man) [15:28:24] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/241598/ (duration: 00m 17s) [15:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:10] ottomata: is the kafka debian repo being maintained? I have a question + a patch to it if you want. [15:29:44] simonft: it is afaik [15:30:18] hoo, ^ [15:30:39] Krenair: Thanks, will test [15:32:17] Works: https://www.wikidata.org/w/index.php?title=Q4115189&type=revision&diff=254114308&oldid=254112903 [15:36:17] paravoid: is there a place I could offer patches? [15:36:41] are you familiar with gerrit? [15:37:20] paravoid: not wikimedia's [15:38:27] there are instructions about our particular workflow: https://wikitech.wikimedia.org/wiki/Help:Git#Set_up_your_ssh_key_in_Gerrit [15:38:37] thanks [15:38:55] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:41:13] simonft: you might be interested in https://bugs.debian.org/786460 as well [15:41:30] simonft: someone has claimed interest into packaging it properly for Debian [15:44:20] paravoid: yes, I am, thanks. That's going to be an interesting project though. [15:44:27] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [15:44:33] what is your interest, if I may ask? [15:45:05] RECOVERY - dhclient process on pybal-test2003 is OK: PROCS OK: 0 processes with command name dhclient [15:45:51] paravoid: planning on running it internally. I'd prefer a debian package, even if that ends up being a bit more work. [15:46:01] right [15:46:08] we are running these packages in production [15:46:29] paravoid: on ubuntu or debian? [15:46:31] but they're a bit hacky, we're using a hand-crafted Makefile iirc [15:46:42] used to be ubuntu, now it's debian [15:46:52] packages probably still work for both [15:47:09] ah. I'm guessing you skipped 14.04, as there seem to be some issues building and installing it there. [15:47:36] (03PS2) 10Alex Monk: Replace underscore with space in uawikimedia's wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241579 (owner: 10Base) [15:47:46] correct [15:48:21] we used to run them with precise (12.04) [15:48:33] (03CR) 10Alex Monk: [C: 032] Replace underscore with space in uawikimedia's wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241579 (owner: 10Base) [15:48:54] (03Merged) 10jenkins-bot: Replace underscore with space in uawikimedia's wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241579 (owner: 10Base) [15:49:35] PROBLEM - Host pybal-test2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:50:07] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/241579/ (duration: 00m 17s) [15:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:50:55] RECOVERY - Host pybal-test2003 is UP: PING OK - Packet loss = 0%, RTA = 34.65 ms [15:51:14] RECOVERY - salt-minion processes on pybal-test2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:51:14] RECOVERY - configured eth on pybal-test2003 is OK: OK - interfaces up [15:51:15] RECOVERY - DPKG on pybal-test2003 is OK: All packages OK [15:51:34] RECOVERY - Disk space on pybal-test2003 is OK: DISK OK [15:52:05] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:52:05] RECOVERY - RAID on pybal-test2003 is OK: OK: no RAID installed [15:57:27] (03PS2) 10ArielGlenn: dumps: move many classes out of worker.py into separate modules [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/241626 [16:35:35] (03PS2) 10Thcipriani: Fix mkdir_p relative paths [tools/scap] - 10https://gerrit.wikimedia.org/r/241576 [16:35:38] (03PS1) 10Thcipriani: Checkout revs to repo-cache, link to repo [tools/scap] - 10https://gerrit.wikimedia.org/r/241684 (https://phabricator.wikimedia.org/T113107) [16:44:58] (03PS3) 10Thcipriani: Add config deployment [tools/scap] - 10https://gerrit.wikimedia.org/r/240292 (https://phabricator.wikimedia.org/T109512) [16:45:21] (03CR) 10jenkins-bot: [V: 04-1] Add config deployment [tools/scap] - 10https://gerrit.wikimedia.org/r/240292 (https://phabricator.wikimedia.org/T109512) (owner: 10Thcipriani) [16:46:46] (03PS4) 10Thcipriani: Add config deployment [tools/scap] - 10https://gerrit.wikimedia.org/r/240292 (https://phabricator.wikimedia.org/T109512) [16:55:38] 10Ops-Access-Requests, 6operations: RESTBase Admin access on aqs1001, aqs1002, and aqs1003 for Joseph and Dan - https://phabricator.wikimedia.org/T113416#1681613 (10RobH) a:3RobH [16:57:59] 6operations, 3labs-sprint-116: Quoted booleans probably stopping a lot of pages - https://phabricator.wikimedia.org/T113781#1681631 (10Andrew) [16:58:15] robh: https://gerrit.wikimedia.org/r/#/c/241562/ mergey? :) [16:58:51] (03Abandoned) 10Yuvipanda: ssh: Disable root logins on prod [puppet] - 10https://gerrit.wikimedia.org/r/160628 (owner: 10Matanya) [17:11:58] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: assess impact of many cassandra seed nodes with multi instance - https://phabricator.wikimedia.org/T113939#1681687 (10Eevans) I did some due diligence on this, and it's not //quite// the case that seeds are only used for bootstrapping. Ther... [17:15:13] (03CR) 10Ori.livneh: "Looks good. Some tiny quibbles inline. For initial deployment, though, instead of issuing the error, why don't we simply std.log() it? Thi" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [17:16:30] (03CR) 10Dduvall: "Looks good barring the commit message—says it fixes `mkdir_p` but it actually fixes `move_symlink`." [tools/scap] - 10https://gerrit.wikimedia.org/r/241576 (owner: 10Thcipriani) [17:21:46] Ignore 51 lag, it is depooled [17:21:50] will ack now [17:24:48] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1681821 (10Andrew) a:5Andrew>3mark Reassigned to Mark for approval of the misc-server allocation [17:30:46] paravoid: it looks like the 0.8.2 tag was never pushed to https://git.wikimedia.org/summary/operations%2Fdebs%2Fkafka [17:31:14] RECOVERY - Disk space on labstore1002 is OK: DISK OK [17:31:35] JohnFLewis: no phab task link? [17:32:05] robh: well not really necessary [17:32:24] (03CR) 10RobH: "I wouldn't merge this until there is a link to the original access request." [puppet] - 10https://gerrit.wikimedia.org/r/241562 (owner: 10John F. Lewis) [17:32:58] robh: also if you looked right above your comment, you'd see me link to it :) [17:33:10] ahh [17:33:21] well the original gerrit commit with the ticket [17:33:26] yea no worries [17:33:46] it's the old 'things are simple with bastions' issue :) [17:33:54] *aren't simple [17:34:03] (03PS2) 10RobH: admin: add asherman to bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/241562 (owner: 10John F. Lewis) [17:34:25] yea its legit just appending the task into commit message and i'll merge [17:34:26] thx for the patchset [17:34:32] okay [17:35:07] (03CR) 10RobH: [C: 032] "I said I wouldn't merge without the info, when the info had already been provided in the review messages ;]" [puppet] - 10https://gerrit.wikimedia.org/r/241562 (owner: 10John F. Lewis) [17:35:31] (03PS4) 10coren: Populate labsdb1004 with mariadb [puppet] - 10https://gerrit.wikimedia.org/r/218874 (https://phabricator.wikimedia.org/T88718) (owner: 10Jcrespo) [17:35:46] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1681882 (10mark) The assignment of a misc box for this is approved for now - but I think we shouldn't just have DNS on it in the future. As discussed, LDAP would also be fine... [17:36:03] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1681883 (10mark) a:5mark>3None [17:36:45] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:46] (03CR) 10coren: [C: 031] "I like the separate roles best. Rebased to head." [puppet] - 10https://gerrit.wikimedia.org/r/218874 (https://phabricator.wikimedia.org/T88718) (owner: 10Jcrespo) [17:37:13] (03PS3) 10RobH: admin: add asherman to bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/241562 (owner: 10John F. Lewis) [17:37:40] (03CR) 10RobH: [V: 032] admin: add asherman to bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/241562 (owner: 10John F. Lewis) [17:38:25] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61464 bytes in 0.094 second response time [17:41:33] akosiaris do you wanna +2 this or is there more to do: https://gerrit.wikimedia.org/r/#/c/231574/ [17:44:06] (03PS9) 10Milimetric: Add Analytics Query Service role [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) [17:44:40] (03PS1) 10coren: Labs: remove NFS from toolserver-legacy [puppet] - 10https://gerrit.wikimedia.org/r/241704 (https://phabricator.wikimedia.org/T104256) [17:52:12] (03CR) 10coren: [C: 032] "Just data." [puppet] - 10https://gerrit.wikimedia.org/r/241704 (https://phabricator.wikimedia.org/T104256) (owner: 10coren) [17:54:01] (03CR) 10Jcrespo: [C: 031] "Ok with it. I will not deploy today, but it can be done without danger at any time." [puppet] - 10https://gerrit.wikimedia.org/r/218874 (https://phabricator.wikimedia.org/T88718) (owner: 10Jcrespo) [17:56:34] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [18:01:52] (03CR) 10BBlack: "The rate-exceeded block runs on every match, not just the first time the rate gets blown. It would spam logs on the order of the DoS rate" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [18:02:00] (03PS4) 10BBlack: varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 [18:11:41] akosiaris, i think i will need to create another service "tilerator-ui" -- identical to tilerator, pointing to the same git repo, but with different config. It will run on all maps backends and allow administration. [18:14:22] Hi bblack! Do you please have a second to tell me whether Special:BannerLoader would automatically hit PHP when called for logged-in users via an Ajax request? Also, if it is, how feasible would it be to make them hit the cache instead just like anons? [18:14:43] s/if it is/if it does/ [18:14:55] or if it would, I guess [18:15:07] * AndyRussG hides from lurking copy editors [18:17:08] yes it would hit PHP [18:17:23] this sounds like a complicated question with complicated design-change answers :) [18:17:42] I'm not even sure if/how/when that's cached for anons tbh [18:21:09] bblack: hmmm OK right... It could certainly be cached for some amount of time, say 5 mins, similar to load.php. Would that be hard? [18:22:41] honestly, I don't know. Do you even know whether the anon version is cached? I imagine for various reasons it's not. [18:22:47] make a phab task :) [18:24:56] 6operations, 10Wikimedia-General-or-Unknown, 7Database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1682103 (10Halfak) I couldn't find the revision with my parsing utility, so I also tried the grep strategy. ``` (3.4)[halfak@stat10... [18:25:43] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1682104 (10ArielGlenn) upgraded snapshot1001 to trusty. puppet runs ok with one exception, unable to finish installation of mcelog because: root@snapshot1001:~# mcelog... [18:26:06] bblack: Hmm... IIRC now it may be cached... rrrg yeah I'll make task! thanks so much!!!! :) [18:28:14] (03PS1) 10Tim Landscheidt: Tools: Install flex on bastions [puppet] - 10https://gerrit.wikimedia.org/r/241727 (https://phabricator.wikimedia.org/T114003) [18:28:29] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1682127 (10ArielGlenn) with certain AMD cpus. [18:36:01] 6operations, 10Datasets-General-or-Unknown, 5Patch-For-Review: Add App Guidelines on Dumps Page - https://phabricator.wikimedia.org/T110742#1682157 (10VBaranetsky) Hi @ArielGlenn - yes, I reviewed the day the change was made. It is exactly what was requested. Thank you so much for your help with this matter.... [18:40:44] (03PS1) 10Ori.livneh: Don't install mcelog on machines with AMD processors [puppet] - 10https://gerrit.wikimedia.org/r/241736 (https://phabricator.wikimedia.org/T94277) [18:40:45] apergos: ^ [18:41:56] I was thnking we would be a bit nicer about it, just exclude those with certain families of processors [18:42:27] if we don't rely on the log for anything, then I don't care at all [18:43:36] (03PS2) 10Ori.livneh: Don't install mcelog on machines with AMD processors [puppet] - 10https://gerrit.wikimedia.org/r/241736 (https://phabricator.wikimedia.org/T94277) [18:43:57] apergos: unfortunately the faq doesn't specify which processors apply [18:44:07] which processors aren't supported, i mean [18:44:29] no it doesn't, give me a couple minutes to see what we have in prod and I'll say something on the changeset [18:45:38] i think it's better this way, to be honest. we don't rely on it for anything, and the faq says "newer amd processors" [18:45:46] (03CR) 10Gilles: [C: 031] Don't install mcelog on machines with AMD processors [puppet] - 10https://gerrit.wikimedia.org/r/241736 (https://phabricator.wikimedia.org/T94277) (owner: 10Ori.livneh) [18:48:15] (03CR) 10BBlack: "Do we have AMD machines or plans for them? also there's some unrelated stuff in here." [puppet] - 10https://gerrit.wikimedia.org/r/241736 (https://phabricator.wikimedia.org/T94277) (owner: 10Ori.livneh) [18:48:36] bblack: see https://phabricator.wikimedia.org/T94277#1682104 [18:50:43] (03CR) 10Dzahn: [C: 031] Filter example ircnick from patch owners list [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/241254 (owner: 10Alex Monk) [18:50:53] apergos: https://github.com/andikleen/mcelog/blob/master/mcelog.c#L478-L547 [18:51:18] (03PS1) 10John F. Lewis: admin: add neilpquinn-wmf to analytic-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/241741 (https://phabricator.wikimedia.org/T113533) [18:52:03] family >= 16 [18:52:03] robh: ^^ :) [18:52:05] that would get it [18:52:29] apergos: also fyi: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=738927 https://github.com/andikleen/mcelog/commit/5709c265d89b4ea06998ca21f148642ca1626939 , [18:52:32] if we could just keep it from getting installed on those, I'd call it good enough [18:52:49] (03CR) 10Gergő Tisza: "That would be a lot simpler indeed, but it would automatically refuse all anonymous access, right? Sentry allows users who have access to " [puppet] - 10https://gerrit.wikimedia.org/r/241578 (owner: 10Gergő Tisza) [18:53:32] yeah I saw the --is_cpu-supported command line option earlier [18:53:52] it's not available yet [18:54:05] nope but it was in their git tree [18:55:18] never mind, you can axe it on all amds, we only have one host in the cluster with them :-D [18:55:59] hah [18:56:06] JohnFLewis: i havent gotten to access requests yet [18:56:15] just finished a different backlog. [18:56:18] (03CR) 10ArielGlenn: [C: 031] "there's only one host in production with AMD processors so this seems a fine choice." [puppet] - 10https://gerrit.wikimedia.org/r/241736 (https://phabricator.wikimedia.org/T94277) (owner: 10Ori.livneh) [18:56:27] (03CR) 10Ori.livneh: [C: 032] Don't install mcelog on machines with AMD processors [puppet] - 10https://gerrit.wikimedia.org/r/241736 (https://phabricator.wikimedia.org/T94277) (owner: 10Ori.livneh) [18:56:40] robh: just a nice poke to say 'when you get to it; merge this, saves time!' :) [18:57:24] (03CR) 10Ori.livneh: [C: 031] varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [18:57:47] Consistent 503 errors on phabricator here [18:57:51] phabricaotr really slow [18:57:59] and now it hangs [18:58:19] twentyafterfour: greg-g: phab having issues? [18:58:30] bblack: btw, varnishrls will automatically graph 429 reqs [18:58:34] which host is it that we have in prod? [18:58:35] bblack: err, varnishstatsd [18:58:40] with AMD, I mean [18:58:40] iridium [18:58:49] why/how? [18:58:52] paravoid: snapshot1001 [18:59:05] chasemp: apparently .. [18:59:12] iridium == phab, snapshot1001 == amd [18:59:14] connects to iridium [18:59:19] it was fine a minute ago [18:59:21] ir tries to [18:59:26] yeah, not loading for me [18:59:32] connecting to iridium isn't happening for me [18:59:34] I was asking about amd, sorry and thanks ori [18:59:40] cant connect, trying mgmt now [19:00:00] I'm on console and nothing is there [19:00:00] ahh [19:00:01] OOM [19:00:05] yea.... eww [19:00:11] chasemp: mgmt console? [19:00:14] right [19:00:17] PROBLEM - RAID on iridium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:25] ok, on console too, but stepping back [19:00:25] it needs a kick as it's OOMing into nonresponse for even ssh [19:00:28] PROBLEM - configured eth on iridium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:29] powercycle [19:00:35] yuck, indeed powercycle it pls chasemp [19:00:36] PROBLEM - dhclient process on iridium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:42] racadm serveraction powercycle [19:00:50] log it [19:01:01] the biggest flaw in linux is the damn oom killer thing. it's insane [19:01:03] That loadavg for iridum.. fail [19:01:06] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:01:10] why is it insane? [19:01:16] PROBLEM - salt-minion processes on iridium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:01:28] it usually kills important things like ssh while allowing the misbehaving process to continue to grow [19:01:35] that's not the case, no [19:01:37] this worked as a paging test [19:01:38] paravoid, because it tries to play AD&N instad of being intellectual? :P [19:01:46] *AD&D [19:01:48] also not the case [19:02:01] and I doubt the effects we're seeing here the OOM killer [19:02:05] most likely it's swapdeath [19:02:05] mutante: it was all planned obviously to test paging :P [19:02:07] PROBLEM - PHD should be supervising processes on iridium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:02:12] due to the machine having lots of swap probably [19:02:13] PROBLEM - DPKG on iridium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:02:14] PROBLEM - Disk space on iridium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:02:24] !log powercycle iridium via console as it's unresponsive [19:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:02:40] JohnFLewis: yes, rob said we needed a test service :p [19:02:43] paravoid: has it changed recently? last time I looked into it, the behavior of OOM on linux was indeed insane for server workloads, it made more sense in a desktop environment maybe but that's not really linux's biggest use case [19:02:57] PROBLEM - puppet last run on iridium is CRITICAL: Timeout while attempting connection [19:02:57] PROBLEM - SSH on iridium is CRITICAL: Connection timed out [19:03:30] the problem with the OOM killer has traditionally been the opposite of what you're saying [19:03:35] the oom killer can suck, but it's also tunable [19:03:36] PROBLEM - Check size of conntrack table on iridium is CRITICAL: Timeout while attempting connection [19:03:56] that it kills the "misbehaving process" but usually that misbehaving process is the process you care about in a server [19:04:05] ssh is not amongst them, though [19:04:15] ssh is oom-adjusted (via oom_adj) to not be a likely target [19:04:24] and yeah if we have large amounts of swap, oomkill doesn't happen till swap is exhausted too, and usually swapping effectively kills the machine anyways [19:04:26] RECOVERY - configured eth on iridium is OK: OK - interfaces up [19:04:27] RECOVERY - dhclient process on iridium is OK: PROCS OK: 0 processes with command name dhclient [19:04:43] I wasn't saying oom was killing ssh, just that OOM was in play and ssh was unresponsive [19:04:46] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures [19:04:46] RECOVERY - SSH on iridium is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [19:04:48] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 21153 bytes in 0.127 second response time [19:05:17] RECOVERY - salt-minion processes on iridium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:05:17] RECOVERY - Check size of conntrack table on iridium is OK: OK: nf_conntrack is 0 % full [19:05:21] chasemp: my last experience with all this was in 2012-2013 so I'm a bit dated [19:05:33] oom_adj exists for many more years before that [19:05:35] I actually remember your last OOM hoorah :) [19:05:45] hahah [19:05:53] paravoid: it's the first I've heard of this oom_adj [19:05:56] RECOVERY - DPKG on iridium is OK: All packages OK [19:05:57] RECOVERY - Disk space on iridium is OK: DISK OK [19:06:08] RECOVERY - RAID on iridium is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [19:06:14] openssh (1:4.7p1-6) unstable; urgency=low [19:06:14] * Disable the Linux kernel's OOM-killer for the sshd parent; tweak [19:06:17] SSHD_OOM_ADJUST in /etc/default/ssh to change this (closes: #341767). [19:06:20] -- Colin Watson Sun, 30 Mar 2008 21:14:12 +0100 [19:06:25] phab oom [19:06:25] https://phabricator.wikimedia.org/P2107 [19:06:42] oom_adj = /proc/$pid/oom_adj [19:07:07] so something was hitting a very expensive php url? [19:07:32] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=iridium.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1443467222&g=cpu_report&z=large&c=Miscellaneous%20eqiad [19:07:37] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=iridium.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1443467222&g=mem_report&z=large&c=Miscellaneous%20eqiad [19:07:41] I'd say so, yes [19:08:15] 6operations, 10Datasets-General-or-Unknown, 7HHVM, 5Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1682248 (10ArielGlenn) Strictly speaking we only need to exclude from hosts with AMD processor family >= 16, but since there's only one host on the... [19:08:50] that's a rather large periodic spike [19:10:24] sigh, our access logs are useless [19:10:34] they are logging remote IP instead of X-F_F [19:10:49] http://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&h=iridium.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=load_report&c=Miscellaneous+eqiad [19:11:27] (03CR) 10Dzahn: "@paravoid see comments above, can't just use the role because iron is different. can i merge this as is and then follow-up with separating" [puppet] - 10https://gerrit.wikimedia.org/r/239023 (owner: 10Dzahn) [19:11:27] paravoid: there was an upstream bug that prevented us fixing that but I think it's unblocked now in phabricator [19:11:39] an upstream bug that what? [19:12:07] and which upstream?? [19:12:22] for phabricator's logs to support X-F_F [19:12:28] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for etonkovidova - https://phabricator.wikimedia.org/T113680#1682273 (10RobH) @Elena, We'll need a few more things from you to make this happen. All of the requirements are outlined on https://wikitech.wikimedia.org/wiki/Requesting_shell_acces... [19:12:33] I was talking about the Apache logs [19:12:38] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for etonkovidova - https://phabricator.wikimedia.org/T113680#1682277 (10RobH) a:3Etonkovidova [19:12:52] that bug and apache logging are != I think [19:12:54] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for etonkovidova - https://phabricator.wikimedia.org/T113680#1673053 (10RobH) p:5Triage>3Normal [19:12:54] but there is a very simple solution that would fix both actually [19:12:54] https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/phabricator/templates/phabricator-default.conf.erb;2e8fe1d82771a64f4df1208a6bcc737d40b880c3$29 [19:12:59] mod-rpaf [19:13:07] ah, that's just an apache extension right? I thought we already enabled that in apache [19:13:12] yeah rpaf [19:13:27] yup [19:13:40] uhm, apparently phd starts as root on a fresh reboot? that's not good [19:13:46] it's an apache module that very early on changes the remote IP to X-F-F, from trusted sources [19:13:55] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for sbisson - https://phabricator.wikimedia.org/T113676#1682285 (10RobH) a:3SBisson @SBisson, We'll need a few more things from you to make this happen. All of the requirements are outlined on https://wikitech.wikimedia.org/wiki/Requesting_s... [19:14:37] should be as simple as including apache::mod::rpaf [19:14:42] I seriously thought we were using that already on iridium, I remember something about it. Maybe the patch never got merged for some reason [19:14:48] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for sbisson - https://phabricator.wikimedia.org/T113676#1682290 (10RobH) p:5Triage>3Normal [19:14:53] (03CR) 10BryanDavis: [C: 032] Filter example ircnick from patch owners list [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/241254 (owner: 10Alex Monk) [19:14:57] RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 6 processes with UID = 997 (phd) [19:15:03] I see on my side high QPS since 5PM [19:15:24] no [19:15:27] needs to include RPAFproxy_ips [19:15:38] I proposed rpaf and it turned into a political discussion about client ip's and storage [19:15:45] a long time ago and I never returned I guess [19:15:46] but phab never was a well beheved child [19:15:53] twentyafterfour: ^ is probably what you remember [19:16:08] 6operations, 10Beta-Cluster, 7Shinken: Make the Shinken IRC alert bot use colors - https://phabricator.wikimedia.org/T113785#1682313 (10hashar) p:5Triage>3Normal [19:16:16] ori: yes, that was the "from trusted sources" above [19:16:27] what was political about it, chasemp? [19:16:37] (03Merged) 10jenkins-bot: Filter example ircnick from patch owners list [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/241254 (owner: 10Alex Monk) [19:17:00] I don't think we should enable mod_rpaf [19:17:03] we should probably universally apply rpaf on every site that is proxied by misc-web [19:17:07] why not? [19:17:16] large spike of connections at 6:56 [19:17:24] paravoid: logging client IPs in general considered harmful? [19:17:43] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1682316 (10hashar) @demon any progress on this? I guess you had more important duties. Should we pair on it #together [19:17:45] it depends on what problem you're trying to solve. if it's a matter of not having real IPs in the log file, adding %{X-Forwarded-For}i to our LogFormats, that would be enough [19:17:46] I don't agree with that opinion but I've seen it expressed [19:18:01] that's not actually true [19:18:09] IPs are getting logged now (to the varnish logs) [19:18:17] ah [19:18:20] as long as we keep it under our data retention rules (or lower) it's fine [19:18:37] I had poked at libapache2-mod-qos when we had the last round of phab misbehaving clients [19:18:53] ori: rpaf fixes logs, but it also fixes the perception of IPs on a) the apache ACLs, b) the applications themselves which may not be as smart about XFF as mediawiki is [19:18:58] so the vast majority of phabricator requests come from wikibugs polling at an insanely fast interval, as far as I can tell [19:18:59] 10Ops-Access-Requests, 6operations: RESTBase Admin access on aqs1001, aqs1002, and aqs1003 for Joseph and Dan - https://phabricator.wikimedia.org/T113416#1682318 (10RobH) a:5RobH>3Milimetric The request isn't quite clear. First its stated that restarting of restbase is required, and then its requested to... [19:19:11] 10Ops-Access-Requests, 6operations: RESTBase Admin access on aqs1001, aqs1002, and aqs1003 for Joseph and Dan - https://phabricator.wikimedia.org/T113416#1682321 (10RobH) p:5Triage>3Normal [19:19:29] paravoid: if we want to solve that bigger problem, then sure [19:19:42] but we'd have to be very careful about ensuring that RPAFproxy_ips was accurate, and if it ever drifted from the actual set of IPs we'd have serious problems [19:20:04] %{X-Forwarded-For}i takes care of logging [19:20:05] a phabricator upstream bug was mentioned above, that's why I mentioned it [19:20:07] jouncebot: next [19:20:07] In 0 hour(s) and 39 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150928T2000) [19:20:19] paravoid: there is a phab upstream bug about client ip's showing in the user profile [19:20:21] that uses xff [19:20:27] but poorly atm it's unrelated to this [19:20:35] labnet1002 [19:20:40] i.e. you logged in from $ip and $ip2 [19:20:46] hm, that didn’t do what I expected [19:21:11] rpaf would solve that too as it wouldn't have to be sorted at the phab level [19:21:43] (03PS2) 10RobH: admin: add neilpquinn-wmf to analytic-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/241741 (https://phabricator.wikimedia.org/T113533) (owner: 10John F. Lewis) [19:21:44] ori: the mod_status you fixed wouldn't have been an issue with rpaf, for example [19:21:49] If that's the case then we should probably use rpaf, IMO [19:22:15] chasemp: any idea why phd started as root? [19:22:26] (03PS2) 10Dzahn: Modified redirects config concerning outreachwiki aliases [puppet] - 10https://gerrit.wikimedia.org/r/241564 (owner: 10Base) [19:22:30] so maybe we make a ticket to simultaneously enable rpaf and ensure logs associated follow our retension guidelines [19:22:36] and have the convo anew [19:22:48] sounds good :) [19:22:54] twentyafterfour: that was me and then me fixing it :) [19:23:26] chasemp: so it doesn't auto-start at all? [19:23:45] paravoid: yeah, i thought about that. the mod_status docs recommend mod_remoteip, though. how does mod_rpaf differ from mod_remoteip? [19:23:46] there is a umask bug noted in teh manifest above teh service definition [19:24:02] Is any of knsq15, knsq6 or pascal (in knams) not decommissioned? [19:24:14] I'm not familiar with mod_remoteip [19:24:54] ah! [19:25:01] remoteip is apache 2.4 [19:25:06] right [19:25:08] and is in mainline, not an external module [19:25:32] I did not know that [19:26:46] (03CR) 10RobH: [C: 032] "3 day wait passed without incident" [puppet] - 10https://gerrit.wikimedia.org/r/241741 (https://phabricator.wikimedia.org/T113533) (owner: 10John F. Lewis) [19:27:37] but yeah, as we move more and more stuff behind misc-web, we can't expect to be able to adjust all of our sites to do the right thing with XFF at the application level [19:27:43] e.g. we were talking about moving OTRS behind misc-web today [19:28:13] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for Neil P. Quinn - https://phabricator.wikimedia.org/T113533#1682348 (10RobH) 5Open>3Resolved a:3RobH 3 day wait passed without objection and @JohnLewis did all the patchset work so I just merged. [19:29:08] (03CR) 10Dzahn: [C: 031] Modified redirects config concerning outreachwiki aliases [puppet] - 10https://gerrit.wikimedia.org/r/241564 (owner: 10Base) [19:29:10] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1682352 (10BBlack) Ok we sorted out some physical issues on lvs1007-9 cabling. Current state is all 6 hosts installed/up/running. The two blocker tasks are sti... [19:29:39] considering apache 2.4.7 seems remoteip is the way of the future then cool [19:29:57] nod [19:30:06] sounds good then [19:33:07] PROBLEM - configured eth on lvs1009 is CRITICAL: eth3 reporting no carrier. [19:33:52] where exactly are our data retention guidelines :) I must be searching wrong thing on wikitech [19:34:08] PROBLEM - configured eth on lvs1008 is CRITICAL: eth3 reporting no carrier. [19:34:36] PROBLEM - configured eth on lvs1007 is CRITICAL: eth3 reporting no carrier. [19:35:03] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for sbisson - https://phabricator.wikimedia.org/T113676#1682369 (10SBisson) I have read and signed the agreement and I think all required information has been provided. Please let me know if something is missing. [19:37:13] 6operations: audit all SSL certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1682373 (10RobH) For the first steps, I've created https://docs.google.com/a/wikimedia.org/spreadsheets/d/1yT5rvoEEUHhNeJAQRVamr8ECqN3TLsMaO8N_At4Ki3I/edit?usp=sharing This lists off all the one-off c... [19:37:21] 6operations, 6Phabricator, 6Release-Engineering-Team: Enable mod_remoteip and ensure logs follow retention guidelines - https://phabricator.wikimedia.org/T114014#1682374 (10chasemp) 3NEW [19:37:25] twentyafterfour: https://phabricator.wikimedia.org/T91648 [19:38:54] 6operations, 6Phabricator, 6Release-Engineering-Team: Enable mod_remoteip and ensure logs follow retention guidelines - https://phabricator.wikimedia.org/T114014#1682382 (10chasemp) [19:38:58] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-116: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1682384 (10Andrew) [19:39:46] (03PS1) 10Dzahn: switch 'stopsurveillance' redirect to policy site [puppet] - 10https://gerrit.wikimedia.org/r/241746 (https://phabricator.wikimedia.org/T97341) [19:43:30] (03CR) 10John F. Lewis: [C: 031] switch 'stopsurveillance' redirect to policy site [puppet] - 10https://gerrit.wikimedia.org/r/241746 (https://phabricator.wikimedia.org/T97341) (owner: 10Dzahn) [19:43:57] o/ csteipp [19:44:34] 6operations: shutdown sodium after mailman has migrated to jessie VM - https://phabricator.wikimedia.org/T82698#1682407 (10Dzahn) >>! In T82698#1679052, @faidon wrote: > I believe we have a decom procedure somewhere. Let's follow it properly and wipe this box. Totally, it's just another ticket, T110142 [19:44:56] 6operations, 10Wikimedia-Mailing-lists: Upgrade to Mailman 3.0 - https://phabricator.wikimedia.org/T97492#1682418 (10Dzahn) [19:44:56] 6operations: mailman - replace lighttpd - https://phabricator.wikimedia.org/T84053#1682419 (10Dzahn) [19:44:58] 6operations: Get rid of all Ubuntu Lucid (10.04) installs - https://phabricator.wikimedia.org/T80945#1682421 (10Dzahn) [19:45:00] 6operations: shutdown sodium after mailman has migrated to jessie VM - https://phabricator.wikimedia.org/T82698#1682416 (10Dzahn) 5Open>3Resolved [19:51:06] (03CR) 10Dduvall: [C: 04-1] "Looks good for the most part. Per our IRC discussion:" (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/240292 (https://phabricator.wikimedia.org/T109512) (owner: 10Thcipriani) [19:53:29] (03PS4) 10Rush: Root out a long chain of quoted bools in nagios/icinga/nrpe [puppet] - 10https://gerrit.wikimedia.org/r/241246 (https://phabricator.wikimedia.org/T113783) (owner: 10Andrew Bogott) [19:53:48] (03CR) 10Andrew Bogott: "puppet compiler says no-op on:" [puppet] - 10https://gerrit.wikimedia.org/r/241246 (https://phabricator.wikimedia.org/T113783) (owner: 10Andrew Bogott) [19:56:05] (03CR) 10Gilles: [C: 031] beta: Remove commented out rules for www2.knams.wikimedia.org/stats [puppet] - 10https://gerrit.wikimedia.org/r/240919 (owner: 10Krinkle) [19:56:56] PROBLEM - puppet last run on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:17] PROBLEM - RAID on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:35] (03CR) 10Andrew Bogott: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/924/" [puppet] - 10https://gerrit.wikimedia.org/r/241246 (https://phabricator.wikimedia.org/T113783) (owner: 10Andrew Bogott) [19:57:51] 6operations, 6Labs, 10Labs-Infrastructure: install/setup labdns1001 - https://phabricator.wikimedia.org/T106584#1682446 (10RobH) [19:58:47] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures [19:59:06] RECOVERY - RAID on analytics1027 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:00:04] gwicke cscott arlolra subbu mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150928T2000). [20:02:01] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-116: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1682465 (10RobH) I'm not sure why this is bare-metal to run labs dns in a public vlan (like holmium) Can't this be a ganeti vm? [20:05:14] hi halfak [20:05:33] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-116: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1682496 (10yuvipanda) Let's definitely not put any labs support things on ganeti - that'll definitely complicate something or the other in the future (mixi... [20:05:48] starting parsoid deploy [20:05:49] 6operations, 6Labs, 10Labs-Infrastructure: install/setup labservices1001 - https://phabricator.wikimedia.org/T106584#1682499 (10RobH) [20:06:41] (03CR) 10Rush: [C: 031] "I believe these are all correct contexts for this change" [puppet] - 10https://gerrit.wikimedia.org/r/241246 (https://phabricator.wikimedia.org/T113783) (owner: 10Andrew Bogott) [20:08:33] (03PS1) 10RobH: setting labservices1001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/241755 [20:10:54] (03CR) 10RobH: [C: 032] setting labservices1001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/241755 (owner: 10RobH) [20:11:13] (03PS3) 1020after4: Fix mkdir_p relative paths [tools/scap] - 10https://gerrit.wikimedia.org/r/241576 (owner: 10Thcipriani) [20:13:57] PROBLEM - Host cp2017 is DOWN: PING CRITICAL - Packet loss = 100% [20:14:14] uh [20:14:18] 2017 [20:14:26] bblack: ^ is that serving traffic? [20:15:18] should we depool it? [20:15:37] mutante: robh chasemp if y'all are around [20:15:50] (03CR) 10Andrew Bogott: [C: 032] Root out a long chain of quoted bools in nagios/icinga/nrpe [puppet] - 10https://gerrit.wikimedia.org/r/241246 (https://phabricator.wikimedia.org/T113783) (owner: 10Andrew Bogott) [20:15:53] I'm not sure if cp2017 is live? has it got any work being done on it? [20:15:59] (03CR) 1020after4: [C: 031] Checkout revs to repo-cache, link to repo [tools/scap] - 10https://gerrit.wikimedia.org/r/241684 (https://phabricator.wikimedia.org/T113107) (owner: 10Thcipriani) [20:16:52] ok I don't see it downtimed so I suppose it's down [20:16:58] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v4, cp2017_v6 [20:17:17] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v4, cp2017_v6 [20:17:36] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v4, cp2017_v6 [20:17:37] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v4, cp2017_v6 [20:17:37] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v4, cp2017_v6 [20:17:47] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v4, cp2017_v6 [20:17:49] (03PS4) 10Andrew Bogott: Logstash: Fixed a quoted boolean in a code comment. [puppet] - 10https://gerrit.wikimedia.org/r/241236 (https://phabricator.wikimedia.org/T113783) [20:17:56] ok, so 2017 is an upload cache in dalls [20:17:56] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v4, cp2017_v6 [20:17:58] *dallas [20:18:05] let me depool it [20:18:27] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v4, cp2017_v6 [20:18:27] https://wikitech.wikimedia.org/wiki/Depooling_servers heh, I wrote this only last week [20:18:28] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v4, cp2017_v6 [20:18:46] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v4, cp2017_v6 [20:18:46] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v4, cp2017_v6 [20:18:46] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v4, cp2017_v6 [20:18:48] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v4, cp2017_v6 [20:18:55] yuvipanda: cool so it was already in maint? [20:19:17] Coren: no it wasn't downtimed or anything [20:19:20] err [20:19:22] chasemp: [20:19:26] (03CR) 10Andrew Bogott: [C: 032] Logstash: Fixed a quoted boolean in a code comment. [puppet] - 10https://gerrit.wikimedia.org/r/241236 (https://phabricator.wikimedia.org/T113783) (owner: 10Andrew Bogott) [20:19:42] chasemp: so I suppose I should depool it? [20:19:50] ah yuvipanda I read "ok I don't see it downtimed so I suppose it's down" as it was downtimed in icinga already [20:19:56] is it responsive at all? [20:20:08] 6operations, 10Datasets-General-or-Unknown, 7HHVM, 5Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1682568 (10ArielGlenn) snapshot1004 reinstalled. pending are: hhvm upstream bug, snapshot1003 to be converted as soon as I know if we'll have a fix... [20:20:20] chasemp: can't ping it [20:20:21] I mean I assume down but maybe icinga is having issues (unlikely w/ ipsec failures associated) [20:20:26] I guess so then [20:21:02] to folks keeping track, i.e. _joe_: "title changed to 'compress.bzip2 not implemented in bz2 extension'" (upstream bug). heh [20:21:03] 6operations, 10MediaWiki-extensions-BounceHandler: Need an administrative front end for BounceHandler - https://phabricator.wikimedia.org/T114020#1682572 (1001tonythomas) 3NEW [20:21:32] chasemp: yeah, can't ping from palladium [20:21:41] chasemp: going to depool it now [20:22:48] !log depooled cp2017 since it's down [20:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:23:03] yuvipanda: it's still enabled in lvs [20:23:56] (03CR) 10Ottomata: [C: 031] hadoop: fix some lint issues [puppet] - 10https://gerrit.wikimedia.org/r/241315 (owner: 10Dzahn) [20:24:55] chasemp: oh right. [20:25:06] chasemp: how do I depool that? [20:25:11] need to also put that in the docs... [20:25:22] /srv/pybal-config? [20:25:25] right [20:25:38] I grep for the host and enabled = false [20:25:48] chasemp: i found the host, it's at true [20:25:49] watch out tho sometimes hanging file changes in that repo [20:25:54] let me switch it to False [20:25:54] simonft: yes please, patches welcome! [20:26:11] if you do submit, try to add me as a reviewer to those patches [20:26:22] ottomata: can you push the tags to the repo? [20:26:30] it looks like a couple are missing [20:26:32] ah, like 0.8.2.2? [20:26:34] !log depooled cp2017 from pybal config too [20:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:45] 0.8.2.1 as well I think [20:26:52] 0.8.2.1 isn't there? should be...hm. [20:26:54] looking [20:27:07] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:27:12] yuvipanda: looks good, pybal picks that up on 60s intervals I think [20:27:13] fyi [20:27:32] 6operations: cp2017 is down - https://phabricator.wikimedia.org/T114022#1682597 (10yuvipanda) 3NEW [20:27:33] Hm! i guess its not. yes, simonft i will push those [20:28:21] ACKNOWLEDGEMENT - Confd template for /etc/varnish/directors.backend.vcl on cp2017 is CRITICAL: Timeout while attempting connection Yuvi Panda https://phabricator.wikimedia.org/T114022 - The acknowledgement expires at: 2015-09-30 20:28:02. [20:28:21] ACKNOWLEDGEMENT - Confd template for /etc/varnish/directors.frontend.vcl on cp2017 is CRITICAL: Timeout while attempting connection Yuvi Panda https://phabricator.wikimedia.org/T114022 - The acknowledgement expires at: 2015-09-30 20:28:02. [20:28:21] ACKNOWLEDGEMENT - Confd vcl based reload on cp2017 is CRITICAL: Timeout while attempting connection Yuvi Panda https://phabricator.wikimedia.org/T114022 - The acknowledgement expires at: 2015-09-30 20:28:02. [20:28:21] ACKNOWLEDGEMENT - DPKG on cp2017 is CRITICAL: Timeout while attempting connection Yuvi Panda https://phabricator.wikimedia.org/T114022 - The acknowledgement expires at: 2015-09-30 20:28:02. [20:28:21] ACKNOWLEDGEMENT - Disk space on cp2017 is CRITICAL: Timeout while attempting connection Yuvi Panda https://phabricator.wikimedia.org/T114022 - The acknowledgement expires at: 2015-09-30 20:28:02. [20:28:21] ACKNOWLEDGEMENT - Freshness of OCSP Stapling files on cp2017 is CRITICAL: Timeout while attempting connection Yuvi Panda https://phabricator.wikimedia.org/T114022 - The acknowledgement expires at: 2015-09-30 20:28:02. [20:28:21] ACKNOWLEDGEMENT - HTTPS on cp2017 is CRITICAL: Return code of 255 is out of bounds Yuvi Panda https://phabricator.wikimedia.org/T114022 - The acknowledgement expires at: 2015-09-30 20:28:02. [20:28:22] ACKNOWLEDGEMENT - IPsec on cp2017 is CRITICAL: Timeout while attempting connection Yuvi Panda https://phabricator.wikimedia.org/T114022 - The acknowledgement expires at: 2015-09-30 20:28:02. [20:28:22] ACKNOWLEDGEMENT - NTP on cp2017 is CRITICAL: NTP CRITICAL: No response from NTP server Yuvi Panda https://phabricator.wikimedia.org/T114022 - The acknowledgement expires at: 2015-09-30 20:28:02. [20:28:23] ACKNOWLEDGEMENT - RAID on cp2017 is CRITICAL: Timeout while attempting connection Yuvi Panda https://phabricator.wikimedia.org/T114022 - The acknowledgement expires at: 2015-09-30 20:28:02. [20:28:23] ACKNOWLEDGEMENT - SSH on cp2017 is CRITICAL: Connection timed out Yuvi Panda https://phabricator.wikimedia.org/T114022 - The acknowledgement expires at: 2015-09-30 20:28:02. [20:28:24] ACKNOWLEDGEMENT - Varnish HTCP daemon on cp2017 is CRITICAL: Timeout while attempting connection Yuvi Panda https://phabricator.wikimedia.org/T114022 - The acknowledgement expires at: 2015-09-30 20:28:02. [20:28:28] simonft: done, pushed 0.8.2.1 and 0.8.2.2 [20:28:39] chasemp: cool! I need to update https://wikitech.wikimedia.org/wiki/Depooling_servers [20:28:50] (03CR) 10Gilles: "How does the code in the change you've linked to take care of not aborting during POSTs on clients' disconnection? In situation where the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240634 (owner: 10Aaron Schulz) [20:30:27] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for etonkovidova - https://phabricator.wikimedia.org/T113680#1682612 (10Krenair) I think the `researchers` group gives the requested abilities. [20:30:28] (03PS4) 10Andrew Bogott: Rsync: Unquote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241235 (https://phabricator.wikimedia.org/T113783) [20:30:50] yuvipanda: in general, things should self-depool if they're problematic, I'll take a look at it [20:31:07] (03CR) 10Aaron Schulz: "I forgot to mention https://gerrit.wikimedia.org/r/#/c/231191/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240634 (owner: 10Aaron Schulz) [20:31:50] (03PS2) 10Aaron Schulz: Removed ignore_user_abort( true ) line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240634 [20:32:23] bblack: ok! [20:32:25] bblack: thanks [20:32:33] I also saw this just after I hit send on an email [20:32:36] oh wel [20:32:38] l [20:32:40] :) [20:32:51] bblack: I'm never fully sure when it's ok and when it isn't, so figured I might as well [20:33:42] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [20:34:22] in general, all the LVS/cp stuff is set up redundantly with self-monitoring and fail-over/out of various kinds. unless there's also something really messed up with those mechanisms, or something is in a strange half-failed state of some kind, one of those machines dying shouldn't cause any realtime-y persistent problems. [20:35:23] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Zhou Zhou - https://phabricator.wikimedia.org/T113325#1682632 (10ZhouZ) Hello! my public key has been uploaded to gerri and my wikitech page. It is here again: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC7Gxb8FnbJRmu5chragFQRDJnLG3Fku7Tl7jkoKg... [20:35:34] 6operations: cp2017 is down - https://phabricator.wikimedia.org/T114022#1682638 (10BBlack) a:3BBlack [20:35:40] 6operations: cp2017 is down - https://phabricator.wikimedia.org/T114022#1682597 (10BBlack) p:5Triage>3Normal [20:35:57] bblack: post changes as noted were you ok w/ https://gerrit.wikimedia.org/r/#/c/240600/? [20:36:27] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: move many classes out of worker.py into separate modules [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/241626 (owner: 10ArielGlenn) [20:36:43] (03CR) 10BBlack: [C: 031] LVS: git-ssh service for phab backend [puppet] - 10https://gerrit.wikimedia.org/r/240600 (owner: 10Rush) [20:37:15] bblack: thanks! [20:38:20] ottomata: thanks! Is 0.8.2.2 not ready yet? [20:38:40] bblack: so in a future cp*** hardware issue I can just file a bug and mark it down and that'd be ok? or is depooling a 'good thing' anyway? [20:38:41] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 60 ESP OK [20:38:42] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [20:38:52] RECOVERY - Host cp2017 is UP: PING OK - Packet loss = 0%, RTA = 34.94 ms [20:39:02] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 60 ESP OK [20:39:11] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 60 ESP OK [20:39:11] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 60 ESP OK [20:39:21] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [20:39:22] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 60 ESP OK [20:39:22] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [20:39:22] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 60 ESP OK [20:39:41] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 60 ESP OK [20:39:51] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 60 ESP OK [20:39:52] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 60 ESP OK [20:39:59] yuvipanda: yeah depooling isn't a bad idea, but depooling isn't well-documented now either heh, although it's getting better! :) [20:40:20] just saying, either way it shouldn't cause a real persistent outage, just some dropped connections that retry fine [20:40:22] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [20:40:24] bblack: ok! [20:40:30] bblack: yeah am trying to document them more :) [20:42:58] !log deployed parsoid version b9e5244e + hotfix on tin to turn off batching api use since canary restart of wtp1002 showed some batching api errors [20:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:50:23] 6operations: cp2017 is down - https://phabricator.wikimedia.org/T114022#1682734 (10BBlack) No logs, syslog/kern.log just go dark around the time of host down, with normal things going on just before like: ``` Sep 28 20:11:02 cp2017 systemd[1]: Reloading A high performance web server and a reverse proxy server. S... [20:51:01] !log MobileApps deployed sha1 9df72ec [20:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:52:17] 6operations: Booleans in hiera may be harmful - https://phabricator.wikimedia.org/T114018#1682751 (10Krenair) [20:54:12] 10Ops-Access-Requests, 6operations: RESTBase Admin access on aqs1001, aqs1002, and aqs1003 for Joseph and Dan - https://phabricator.wikimedia.org/T113416#1682760 (10Milimetric) @RobH: some more background might be necessary: There will be an additional RESTBase cluster called "Analytics Query Service" and det... [20:55:54] (03CR) 10Gilles: [C: 031] Removed ignore_user_abort( true ) line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240634 (owner: 10Aaron Schulz) [21:07:45] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:13:23] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [21:18:06] 6operations, 10ops-eqiad: analytics1049 /dev/sdi busted - https://phabricator.wikimedia.org/T114034#1682914 (10Ottomata) 3NEW [21:19:37] ottomata: gbp.conf is looking for tags in the form upstream/. Should those be in the repo as well? [21:20:02] looking [21:20:37] hmm, i think so simonft, doing so.. [21:21:09] just pushed upstream/0.8.2.2 [21:22:13] ottomata: thanks. That tag doesn't seem to be on either of the branches? [21:24:05] no, we haven't packaged 0.8.2.2 yet [21:24:17] 0.8.2.2 has to JIRAs fixed i think [21:24:22] one of which is critical for kafka brokers [21:24:23] using snappy [21:24:37] but, we already patched our version of 0.8.2.1 to include that fix [21:24:52] 0.8.2.1-3 release - fix for snappy 1.1.1.6 bug [21:25:23] https://github.com/wikimedia/operations-debs-kafka/commit/10bb882db14563648eb62c41528fb6b980aa04cf [21:25:47] ah, thanks for the clarification. If we end up packaging 0.8.2.2 I'll push the changes up. [21:29:42] I have a couple questions about RESTBase deployment and service-runner based applications. [21:30:03] We're getting ready to deploy the Analytics Query Service and I was going through the deployment documentation https://wikitech.wikimedia.org/wiki/RESTBase#Deployment_and_config_changes [21:30:09] 6operations, 10ops-eqiad: ms-be1012: slot=5 dev=sdf failed - https://phabricator.wikimedia.org/T113929#1682961 (10Cmjohnson) 64 days remain on the warranty. New disk request sent Congratulations: Work Order SR917796267 was successfully submitted. [21:30:46] can anyone help me think through how this works when we're deploying RESTBase but with a different config.yaml? [21:32:36] (03PS4) 10Thcipriani: Fix move_symlink relative paths [tools/scap] - 10https://gerrit.wikimedia.org/r/241576 [21:34:24] (03PS1) 10ArielGlenn: dumps: convert tabs to spaces for worker.py and related modules [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/241901 [21:34:46] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:35:35] Hey yuvipanda, what project in labs are you using etcd in? [21:35:38] i want to see how you set etcd_servers [21:35:51] ottomata: k8s-eval [21:36:28] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [21:36:39] ahhh, k thanks, fojd it. [21:36:39] hm [21:36:46] ottomata: but don't look at that, look at manifests/role/tools.pp [21:36:54] ottomata: that's the way I prefer specifying it :) [21:37:04] in role::toollabs::k8s::master [21:38:01] 6operations, 10ops-eqiad: analytics1049 /dev/sdi busted - https://phabricator.wikimedia.org/T114034#1682974 (10Cmjohnson) Requested a new disk from Dell. Congratulations: Work Order SR917796573 was successfully submitted. [21:38:18] oh ok yuvipanda. where is etcd_hosts set? [21:39:11] ottomata: right now nowhere, so it's only $::fqdn [21:39:18] ottomata: since I"m running it on only one host :P [21:39:51] (03PS2) 10ArielGlenn: dumps: convert tabs to spaces for worker.py and related modules [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/241901 [21:39:59] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1682977 (10Cmjohnson) I removed all the sfp's but it's possible some of those were the wrong type. Let's coordinate Tuesday 9/29 to add 2 and see if the error presents itself again. [21:40:30] ohhh default is fqdn [21:40:31] got it [21:40:33] ok [21:40:43] you'd expect that just to be an array of hostnames then? [21:40:56] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1682990 (10BBlack) >>! In T112781#1646330, @faidon wrote: > As to how to move forward: first off, let's make sure all of these servers are installed; the reboot loops are probably not helping. FTR, this... [21:41:04] yuvipanda: i would either make it an array of hostname:ports, or make it be a hash with more specific fields [21:41:05] ottomata: yes [21:41:07] things like cert, etc. [21:41:15] etcd_config maybe? i dunno ,maybe not [21:41:17] ottomata: yes, like how you have for zookeper :D [21:41:33] ottomata: I've a similar string mashing thing for zookeeper as well [21:41:55] hm, sort of, zookeeper_hosts assigns zkids [21:42:03] yeah i have that for zk too :) [21:42:13] ottomata: look at manifests/role/labsmesos.pp [21:42:20] ottomata: I think we should standardize the ports [21:42:27] and it should just be an array of hostnames [21:42:35] ja i have [21:42:35] $zookeeper_hosts = keys(hiera('zookeeper_hosts')) [21:42:52] yuvipanda: i'm ok with that for now, buUuUuuut, yeah, ok. [21:42:54] :) [21:42:56] I have [21:42:58] $zookeeper_hosts = join(suffix(keys(hiera('zookeeper_hosts')), ':2181'), ',') [21:43:07] specifies the ports too :D [21:43:24] yeah :) [21:43:42] $zookeeper_url = inline_template("<%= @zookeeper_hosts.sort.join(',') %><%= @zookeeper_chroot %>") [21:43:43] :) [21:43:45] mUshing [21:43:57] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:57] that too [21:44:01] nice [21:46:16] PROBLEM - puppet last run on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:46:42] yuvipanda: more qs for you. do you know much about the cert thing? [21:46:49] do I need a portion of etcd::ssl [21:46:50] ? [21:46:57] everything but the private keys? [21:47:03] I know unfortunately too much about it [21:47:05] for clients [21:47:07] yes [21:47:09] everything except private keys [21:47:15] see k8s::ssl. almost exact same code [21:47:21] we might make it not necessary at some time in the future [21:47:25] since the CA can be made public [21:47:27] but not yet [21:47:28] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [21:47:56] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [21:48:20] ottomata: actually [21:48:32] Hm, yuvipanda maybe etcd::ssl should be split up into two classes then? [21:48:35] ottomata: is this for client or server? [21:48:37] client [21:49:04] ottomata: yes it should be. I think it's ok to just have a param (look at k8s::ssl) [21:49:11] hm [21:49:29] ok yeah that would be fine [21:49:40] will patch, see what j o e says... [21:49:46] kk [21:52:17] PROBLEM - Hadoop JournalNode on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:33] what the hehhhck is wrong with icinga and an35! [21:52:34] gah! [21:52:53] heh well its testing my new paging sms provider pretty well ;] [21:53:58] RECOVERY - Hadoop JournalNode on analytics1035 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode [21:54:29] meh? [21:54:31] robh: interestingly my messages are apparently coming from 'icinga@wikimedia.org' [21:54:37] might be time to ignore pages tonight [21:54:39] err [21:54:43] icinga@neon.wikimedia.org [21:57:17] night folks [21:58:48] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [500.0] [22:01:02] hm, yuvipanda do I have to copy the cert(s) over, or can I just use the one already on the host from puppet? [22:01:15] i see your flannel systemd does [22:01:20] --etcd-cafile=/var/lib/puppet/ssl/certs/ca.pem [22:01:37] ottomata: yes they're the same. we're using the puppet CA + certs for these [22:01:46] ottomata: I did that with flannel because flannel is already running as root [22:02:01] ahh right, this won't be able to read that [22:02:32] ok [22:02:35] ottomata: yup hence making puppet copy things over [22:05:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [22:07:33] (03CR) 10Faidon Liambotis: [C: 04-1] move misc/labsdebrepo out of misc to module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [22:07:40] yuvipanda: ? [22:07:46] yuvipanda: your pages are? [22:08:00] Well, you need to keep in mind that your pages are via your cellular providers email to sms gateway [22:08:00] robh: yes, I'm not sure how that is. I don't see a number even [22:08:09] that's doubly interesting [22:08:12] oh wait [22:08:31] hmm since I didn't know you can have sms come from things that aren't numbers [22:08:38] So its not that difficult if you parse out how it sends the sms notifications [22:08:43] we generate the sms notifications out to email [22:08:58] so we send to your_cell@provider.email.sms.gateway [22:09:04] if you are a US based cell phone # [22:09:19] so then its up to your cell provider and how they pass origin id [22:09:30] seems like they just use the email address for... verizon? [22:09:35] i dont have the file open anymore ;] [22:10:32] ah ok [22:13:22] (03PS2) 10Ottomata: Puppetize etcd use for eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/240916 (https://phabricator.wikimedia.org/T112688) [22:13:46] yuvipanda: would appreciate a review if you have some time [22:13:47] https://gerrit.wikimedia.org/r/#/c/240916/ [22:14:23] (03PS3) 10Ottomata: Puppetize etcd use for eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/240916 (https://phabricator.wikimedia.org/T112688) [22:15:07] (03PS4) 10Ottomata: Puppetize etcd use for eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/240916 (https://phabricator.wikimedia.org/T112688) [22:17:01] (03CR) 10Dduvall: [C: 032] Fix move_symlink relative paths [tools/scap] - 10https://gerrit.wikimedia.org/r/241576 (owner: 10Thcipriani) [22:18:05] ottomata: not sure what you mean by 'unfortunately it is hardcoded here'? [22:18:14] it is in a separate hiera file, I don't see how that is 'hardcoded' [22:19:03] ottomata: also is passing the cert path in the URI a standard thing? [22:21:13] naw, its an eventlogging thing [22:21:31] configs via uri [22:22:02] https://gerrit.wikimedia.org/r/#/c/238854/11/server/eventlogging/utils.py [22:22:09] ottomata: should probably mention that :D [22:22:11] so ti isn't confusing etc [22:22:36] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:22:43] well, its in the eventlogging role [22:22:51] but ok! [22:23:06] nbd [22:23:30] yuvipanda: i'll mention it in the processor define for param doc [22:23:38] ok [22:25:10] (03PS5) 10Ottomata: Puppetize etcd use for eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/240916 (https://phabricator.wikimedia.org/T112688) [22:28:54] RoanKattouw: ostriches: Krenair: greg-g: I'm going add a request for a slot on the evening SWAT deploy, hope that's OK. Just waiting for Jenikins to merge a thing [22:30:05] RoanKattouw: ostriches: Krenair: greg-g: This is the tip of my cherry-pickings https://gerrit.wikimedia.org/r/#/c/241926/ , just one more patch that I'm waiting for Jenkins to merge before cherry-picking onto that [22:30:33] Sure [22:30:43] It's not for another half hour, you're not even close to being late [22:30:47] "on time" is 4:58pm :P [22:30:49] *3:58 [22:31:10] (03Merged) 10jenkins-bot: Fix move_symlink relative paths [tools/scap] - 10https://gerrit.wikimedia.org/r/241576 (owner: 10Thcipriani) [22:33:37] csteipp: yt? [22:33:52] RoanKattouw_away: cool! thx :) [22:33:58] (03PS5) 10Rush: LVS: git-ssh service for phab backend [puppet] - 10https://gerrit.wikimedia.org/r/240600 [22:33:59] nuria: Yeah, what's up? [22:34:26] csteipp: at your convenience please take a look at: [22:34:54] csteipp: https://wikitech.wikimedia.org/wiki/Analytics/Data/PreventingIdentityReconstruction [22:35:06] csteipp: and let us know if anything else needs to be done [22:35:38] (03CR) 10Rush: [C: 032] LVS: git-ssh service for phab backend [puppet] - 10https://gerrit.wikimedia.org/r/240600 (owner: 10Rush) [22:35:48] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1683319 (10chasemp) [22:37:05] nuria: Yeah, I saw that this morning. I wasn't sure on #Next_Steps if you're thinking you'll publish once user_agent_map is removed? Or will you do the exploration with geo before publishing? I think you should :) [22:38:22] csteipp: we will only publish some dimentions if we ever do it (cc milimetric ) but we have no imediate plans to do so [22:38:28] *dimensions [22:38:46] csteipp: it is likely we will need all geo and user agent info removed [22:41:18] nuria: Cool. Yeah, that's what I was worried about, and it appears that a least a few users would be exposed. So I'm glad we're reevaluating. [22:42:40] So not knowing what pressure your team is under to make that api happen, just let me know when you reevaluate, and I'm happy to re-look at whatever you come up with. [22:43:01] csteipp: the api does not surface all dimentions though [22:43:28] csteipp: cause a known exploit (even before this work) will be to geo-locate editors for pages with a few pageviews [22:43:46] csteipp: as edit data is already public [22:44:06] csteipp: so api surfaces partial dimentions [22:44:15] *dimensions, argh cc milimetric [22:49:41] csteipp: nuria's email wasn't related to the pageview API, just to be clear [22:51:11] I think that investigation is useful more for the question of "how long do we keep around pageview_hourly" data? And the conclusion is: it makes it easy for a hacker to reconstruct some data we wouldn't want them to have. So we should consider purging at least user-agent-map after X days. [22:52:09] but the data in pageview_hourly will never be made public in that form, and will only be public in a form that's not susceptible to any de-anonymization attack (which is like really hard to guarantee) [22:54:10] Ah, that's right. Sorry, so many projects at once :) [22:59:23] AndyRussG: are you going to pull the trigger on the cookie patch? [23:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150928T2300). [23:00:04] RoanKattouw matt_flaschen AndyRussG: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:13] ori: sooooooon... I'll look at again tonight or tomorrow morning :) [23:00:32] I'll do it [23:00:37] Present [23:00:44] AndyRussG: seriously, that's a bit rude. It fixes an issue in your repository and you've stalled it for no good reason that I can see. [23:01:29] ori: no it's not. I'm sorry you think that, but I happy to talk about it at another time [23:01:46] AndyRussG: What am I supposed to do in re the CN patch you put in SWAT? [23:01:55] Cause it's already merged into wmf_deploy, so what do I do now [23:02:01] RoanKattouw: yeah I'm preparing the core patch now [23:02:07] Oh OK [23:02:18] RoanKattouw: that was just put there so you could see what was going in :) [23:03:12] I guess you have to do manual submodule updates for that branch? [23:03:33] On the wmf/1.xwmfy branches we get automatic submodule updates :) [23:04:29] for most extensions anyway [23:05:49] RoanKattouw: Krenair: yeah CN is manual... [23:06:04] PROBLEM - Confd template for /etc/pybal/pools/git-ssh on lvs1009 is CRITICAL: File not found: /etc/pybal/pools/git-ssh [23:06:55] PROBLEM - Confd template for /etc/pybal/pools/git-ssh on lvs1006 is CRITICAL: File not found: /etc/pybal/pools/git-ssh [23:07:34] PROBLEM - Confd template for /etc/pybal/pools/git-ssh on lvs1003 is CRITICAL: File not found: /etc/pybal/pools/git-ssh [23:07:39] RoanKattouw: Krenair: hmm those nice instructions that used to be there about updating the submodule branch are gone... [23:07:57] because it's no longer necessary to do that for normal extensions [23:08:13] Yeah I got rid of them [23:09:44] It does auto-update [23:09:52] I see a whole bunch of CentralNotice commits in wmf24 [23:10:12] ragesoss: Ahh K so I guess it's automatic now for CN? [23:10:26] RoanKattouw: ^ (sorry ragesoss!) [23:10:49] chasemp: the stuff about above missing git-ssh pool file for pybal, that's another part of this we didn't puppetize yet I guess [23:11:24] RoanKattouw, isn't CentralNotice supposed to be tracking the wmf_deploy branch instead? [23:11:29] It does [23:11:45] See c255a4857fb7c5919170c0aaac0d21a9f9c98fa6 in core's wmf24 branch [23:11:59] Heh that would explain why git diff was coming up empty ;) [23:12:00] It auto-updates the submodule pointer for Ia28c3503b17bf18dc7fb18ba0cf9bd61d2703f6b [23:12:46] which Gerrit says is a commit in wmf_deploy [23:12:53] !log catrope@tin Synchronized php-1.26wmf24/extensions/CentralNotice: SWAT (duration: 00m 19s) [23:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:12] !log catrope@tin Synchronized php-1.26wmf24/extensions/Flow: SWAT (duration: 00m 19s) [23:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:31] !log catrope@tin Synchronized php-1.26wmf24/extensions/Echo: SWAT (duration: 00m 18s) [23:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:05] Krenair: yaaay good luck! :DD [23:14:22] RoanKattouw: Krenair: yeah looks great! [23:14:26] wrong channel legoktm :P [23:15:51] (03PS1) 10Ottomata: Deploy eventlogging to terbium [puppet] - 10https://gerrit.wikimedia.org/r/241969 (https://phabricator.wikimedia.org/T112660) [23:15:59] OK all done [23:16:16] RoanKattouw: all done, as in, all done? @_@ [23:17:26] 6operations: audit all SSL certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1683492 (10RobH) google sheet has been updated with the most recent mail purchases and all info has been added to the google calendar for expiry tracking. each entry has a 4 week notification email se... [23:17:45] 6operations: audit all SSL certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1683493 (10RobH) 5Open>3Resolved Actually, we'll just create a new task and link. [23:18:12] weeeeee [23:18:48] (03CR) 10Ottomata: [C: 032] Deploy eventlogging to terbium [puppet] - 10https://gerrit.wikimedia.org/r/241969 (https://phabricator.wikimedia.org/T112660) (owner: 10Ottomata) [23:20:05] PROBLEM - Hadoop NodeManager on analytics1049 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [23:20:15] (03CR) 10Ori.livneh: "Yes, it wouldn't work for anonymous access. But I very much doubt that the interface could be sufficiently sanitized from potentially priv" [puppet] - 10https://gerrit.wikimedia.org/r/241578 (owner: 10Gergő Tisza) [23:22:07] 6operations: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1683510 (10RobH) 3NEW [23:24:50] RoanKattouw: so far so good, looks like everything's in order, thanks so much!!!!! [23:29:35] 6operations, 6Labs, 10Labs-Infrastructure: install/setup labservices1001 - https://phabricator.wikimedia.org/T106584#1683533 (10RobH) [23:30:11] (03PS1) 10Mforns: [WIP] Consume EventLogging validation logs from Logstash [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) [23:30:30] (03PS2) 10Mforns: [WIP] Consume EventLogging validation logs from Logstash [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) [23:30:50] (03CR) 10Mforns: [C: 04-1] "Still WIP" [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [23:35:42] (03PS3) 10Mforns: [WIP] Consume EventLogging validation logs from Logstash [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) [23:37:27] (03PS1) 10RobH: labservices1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/241996 [23:38:16] (03CR) 10Mforns: [C: 04-1] "Still WIP." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [23:38:42] (03PS1) 10Yuvipanda: ldap: Remove add-labs-user [puppet] - 10https://gerrit.wikimedia.org/r/241998 [23:38:51] (03PS2) 10Yuvipanda: ldap: Remove add-labs-user [puppet] - 10https://gerrit.wikimedia.org/r/241998 [23:39:09] (03CR) 10RobH: [C: 032] labservices1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/241996 (owner: 10RobH) [23:41:01] legoktm, so I guess we're doing foreachwiki runBatchedQuery.php "delete from user_properties where up_property = 'searchNs-1' limit 500;" or something? [23:42:11] (03PS3) 10Ori.livneh: Removed ignore_user_abort( true ) line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240634 (owner: 10Aaron Schulz) [23:42:14] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [23:42:19] (03CR) 10Ori.livneh: [C: 032] Removed ignore_user_abort( true ) line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240634 (owner: 10Aaron Schulz) [23:42:27] (03Merged) 10jenkins-bot: Removed ignore_user_abort( true ) line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240634 (owner: 10Aaron Schulz) [23:43:25] (03CR) 10EBernhardson: [C: 032] Fix TTMServer config to use the extra plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241019 (https://phabricator.wikimedia.org/T113711) (owner: 10DCausse) [23:43:45] !log ori@tin Synchronized wmf-config/CommonSettings.php: I56db35b: Removed ignore_user_abort( true ) line (duration: 00m 18s) [23:43:47] AaronSchulz: ^ [23:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:55] \o/ [23:45:45] (03PS3) 10Yuvipanda: ldap: Remove add-labs-user & scriptconfig.py [puppet] - 10https://gerrit.wikimedia.org/r/241998 [23:46:09] greg-g: Is there a process (I'm assuming no) for making mass changes to the user properties table (two use cases – to clean it up and kill a few million nonsense rows, and to mass-set a preference in advance of a config change)? [23:46:17] ( /cc Krenair.) [23:46:50] (03PS4) 10Yuvipanda: ldap: Remove add-labs-user & scriptconfig.py [puppet] - 10https://gerrit.wikimedia.org/r/241998 [23:47:01] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Remove add-labs-user & scriptconfig.py [puppet] - 10https://gerrit.wikimedia.org/r/241998 (owner: 10Yuvipanda) [23:47:12] !log ebernhardson@tin Synchronized wmf-config/CommonSettings.php: Update ttmserver configuration to match elasticsearch security profile (duration: 00m 17s) [23:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:51:29] James_F, you wouldn't call "ineligible for SWAT" a process, but... :p [23:52:06] oh, I guess we also need to clear up our enable pref after the default switch is thrown [23:52:20] Krenair: Sure, but does that mean "Krenair can just do it when it's quiet" or "OMG 2000 years of committee review and must be carried out by 20 people OKing it at once within 20 seconds". [23:52:22] Yeah. [23:53:09] James_F: none that I know of :) [23:53:35] (03PS1) 10Yuvipanda: ldap: mwclient no longer needed for ldap client [puppet] - 10https://gerrit.wikimedia.org/r/242012 [23:53:37] (03PS1) 10Yuvipanda: ldap: Clean out some ensure => absented files [puppet] - 10https://gerrit.wikimedia.org/r/242013 [23:53:44] greg-g: OK, in that case let's take it carefully and do it in small batches, if that makes sense Krenair? [23:53:50] yeah I think so [23:54:06] s/batches/batches of batches/ [23:57:39] batches of patches