[00:00:02] bblack would know of course [00:00:39] it's correct, except there's an effort to eliminate the lvs from between varnish and apache and rely on varnish's own load-balancing capabilities [00:02:44] and you probably use memcache on the backend to avoid too much database read? [00:02:58] yep [00:03:05] Toordog: we have a lot of layers of caching [00:03:23] memcached, then there is 'parser cache', we also use redis for sessions (and job queues) [00:04:06] I was wondering what was the use of redis [00:04:23] parser cache sits in mysql doesn't it? [00:04:24] Toordog: we use it for session storage because it can replicate cross DC [00:04:30] ok [00:04:31] Krenair: it's a separate mysql cluster, yeah [00:04:34] yeah [00:04:47] redis also gets used for sending the newer (non-irc) rc feeds [00:05:39] what is squid for? on the architecture drawing, it shows that varnish is only used for mobile and upload, while squid would be used for normal content? [00:05:42] and something to do with file backends, and something to do with the profiler. according to grep on the mediawiki config [00:05:55] squid was replaced with varnish, mostly [00:06:09] ah ok, outdate drawing then :) [00:06:14] that make more sense too [00:08:09] There's still a bunch of things that refer to varnish as 'squid' [00:08:39] yeah [00:08:44] the schema make it hard to see which cluster run what, what is the front end and backend *other than obvious reason like a database*. [00:08:53] what diagram is this? [00:09:01] the one on the link you sent me about LVS [00:09:25] ah [00:09:37] that's just the english wikipedia article on LVS [00:09:43] nothing to do with wikimedia's architecture [00:10:00] aww, i though it was the wikimedia ofne [00:10:01] which shows wikimedia's network as an example, see the diagram [00:10:01] hahaha [00:10:03] oh lol [00:10:04] it does [00:10:12] yeah that's very outdated [00:10:33] you have a more recent schema? [00:10:48] nope [00:11:02] someone should do one at some point.... >_> [00:11:16] if i get hired i'll do it ;) [00:11:26] heh [00:12:02] YuviPanda, I suspect that has been suggested multiple times over the years, someone has drawn a diagram, and then it's left to rot until someone else starts again from scratch :/ [00:12:28] yup [00:12:34] needs an extensible diagram [00:13:09] this one is an SVG by Ryan Lane, two years old [00:16:16] Oh, look, second entry on wikitech's Special:ListFiles: [00:16:17] https://wikitech.wikimedia.org/wiki/File:Wikimedia-cluster.svg [00:18:08] nice you are using logstash :) [00:25:01] nice [00:25:05] hat's pretty accurate [00:41:58] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.119601328904 [00:52:29] 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1634588 (10Tgr) 3NEW [00:52:55] 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1634598 (10Tgr) [01:06:08] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.00333333333333 [01:16:20] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:09] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 1.56 ms [02:34:18] !log l10nupdate@tin Synchronized php-1.26wmf22/cache/l10n: l10nupdate for 1.26wmf22 (duration: 10m 13s) [02:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:39] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 21 connecting: (unnamed) not-conn: cp2015_v6, cp3017_v6, cp4011_v6 [02:37:39] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [02:40:44] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf22) at 2015-09-13 02:40:43+00:00 [02:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:49] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:23:03] (03CR) 10Tim Starling: "Yes, it was Jimmy's idea in the first place, but Jimmy is not blocking alteration of that setting. According to Greg Maxwell he was in fav" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 (owner: 10Kaldari) [03:25:48] RECOVERY - Disk space on labstore1002 is OK: DISK OK [03:35:49] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:09] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:48] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Puppet has 1 failures [03:49:09] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [04:02:01] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:02:19] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [04:02:58] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:06:09] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:10:09] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2628.38 Read Requests/Sec=2375.59 Write Requests/Sec=48.31 KBytes Read/Sec=9518.28 KBytes_Written/Sec=193.26 [04:11:38] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [04:16:09] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=2.41 Read Requests/Sec=0.00 Write Requests/Sec=0.30 KBytes Read/Sec=0.00 KBytes_Written/Sec=1.20 [04:35:49] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [04:45:49] PROBLEM - puppet last run on mw2031 is CRITICAL: CRITICAL: puppet fail [04:45:58] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [04:51:58] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [04:55:59] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [05:13:59] RECOVERY - puppet last run on mw2031 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [05:38:28] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [06:02:52] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Sep 13 06:02:52 UTC 2015 (duration 2m 51s) [06:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:15:28] (03PS33) 10Gergő Tisza: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [06:25:56] 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, 5ContentTranslation-Release6: Apertium Failed to load resource: net::ERR_SPDY_PROTOCOL_ERROR - https://phabricator.wikimedia.org/T112403#1634695 (10Amire80) p:5Triage>3Normal [06:25:59] PROBLEM - Disk space on elastic1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%) [06:27:59] RECOVERY - Disk space on elastic1001 is OK: DISK OK [06:31:00] PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:28] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:20] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:20] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:28] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:29] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:39] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:49] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:38] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 6 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1634703 (10Bawolff) Just for cross-reference, a regression was found with rsvg - svgs > 10 MB (there's about 9000 such files) don't render. We'd ne... [06:55:48] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:09] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:56:39] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:39] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:40] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:56:48] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:56:59] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:10] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:57:19] RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [07:00:59] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [07:09:39] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [07:27:09] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:56:00] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:32:30] (03CR) 10Ricordisamoa: [C: 031] Configure $wgExtraSignatureNamespaces for it.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237331 (https://phabricator.wikimedia.org/T7645) (owner: 10Nemo bis) [10:18:02] 6operations, 10netops: Wikimedia sites not reachable through CenturyLink ISP - https://phabricator.wikimedia.org/T112396#1635062 (10faidon) [10:19:47] 6operations, 10netops: Wikimedia sites not reachable through CenturyLink ISP - https://phabricator.wikimedia.org/T112396#1634165 (10faidon) OK. It's impossible to find out what happened without more data unfortunately :( I'm going to resolve this for now but feel free to reopen (or open a new task) if somethi... [10:21:48] 6operations, 10netops: Wikimedia sites not reachable through CenturyLink ISP - https://phabricator.wikimedia.org/T112396#1635069 (10faidon) 5Open>3Resolved a:3faidon [12:20:31] 6operations, 7Pybal: jessie pybals get restarted every day by logrotate, resetting BGP sessions - https://phabricator.wikimedia.org/T112457#1635133 (10faidon) 3NEW [12:26:47] 6operations, 7Pybal: jessie pybals get restarted every day by logrotate, resetting BGP sessions - https://phabricator.wikimedia.org/T112457#1635140 (10faidon) The easy fix here is to use `reload` rather than `force-reload` from logrotate. It's handled identically by the init script but isn't being interpreted... [12:29:58] (03PS1) 10Faidon Liambotis: Use "reload" instead of "force-reload" from logrotate [debs/pybal] - 10https://gerrit.wikimedia.org/r/237986 (https://phabricator.wikimedia.org/T112457) [13:19:19] (03CR) 10BBlack: [C: 031] Use "reload" instead of "force-reload" from logrotate [debs/pybal] - 10https://gerrit.wikimedia.org/r/237986 (https://phabricator.wikimedia.org/T112457) (owner: 10Faidon Liambotis) [14:21:18] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [14:21:35] Stuff broken? [14:22:00] "Request: GET http://en.wikipedia.org/wiki/User_talk:Orion_2012, from 10.20.0.105 via cp1052 cp1052 ([10.64.32.104]:3128), Varnish XID 78792582 [14:22:01] Forwarded for: 80.176.129.180, 10.20.0.176, 10.20.0.176, 10.20.0.105 [14:22:02] Error: 503, Service Unavailable at Sun, 13 Sep 2015 14:21:35 GMT " [14:22:10] Planned outage guys? [14:22:10] errore 503 being served by esams [14:22:18] *error [14:22:33] Who had calamari for lunch ;) XD [14:22:48] Better? [14:22:50] [16:21:18] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 <-- sounds serious [14:23:00] Working here again. [14:23:16] same here [14:23:55] 6operations: 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635273 (10Multichill) 3NEW [14:26:03] 6operations: 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635281 (10Vituzzu) The same at it.wiki. [14:27:54] Error: 503, Service Unavailable at Sun, 13 Sep 2015 14:27:28 GMT [14:27:58] PROBLEM - HHVM busy threads on mw1247 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [14:28:09] hmpf [14:28:15] Up but intermittent here - (UK [14:28:40] dead here too [14:28:41] Someone make a big change? [14:28:50] PROBLEM - HHVM busy threads on mw1170 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [115.2] [14:29:10] I've noted this sort of things sometimes happens when a template change propogates across a number of pages [14:29:30] 6operations: Intermittent 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635285 (10Multichill) p:5Triage>3Unbreak! [14:29:49] PROBLEM - HHVM rendering on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:29:49] PROBLEM - Apache HTTP on mw1251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:29:50] PROBLEM - HHVM busy threads on mw1043 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:29:50] PROBLEM - Apache HTTP on mw1252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:29:53] ShakespeareFan00: not sure it would cause a complete blackout [14:29:59] Hi guys, https://phabricator.wikimedia.org/T112463 [14:30:13] 6operations: Intermittent 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635290 (10MarcoAurelio) Same at eswiki: Request: GET `[removed]`, from 10.20.0.176 via cp1066 cp1066 ([10.64.0.103]:3128), Varnish XID 2216592190 Forwarded for: `***.***.***.***`, 10.20.0.109, 10.20.0.109, 10.20.0.176... [14:30:29] PROBLEM - HHVM queue size on mw1068 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:30:29] Intermittent 503's on multiple sites (Wikipedia, Commons, Wikidata, and probably more) [14:30:39] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [500.0] [14:30:49] PROBLEM - HHVM queue size on mw1247 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0] [14:31:09] PROBLEM - HHVM queue size on mw1253 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0] [14:31:48] RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 65419 bytes in 0.082 second response time [14:31:48] RECOVERY - Apache HTTP on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [14:31:49] RECOVERY - Apache HTTP on mw1252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.034 second response time [14:31:59] RECOVERY - HHVM busy threads on mw1043 is OK: OK: Less than 30.00% above the threshold [57.6] [14:32:18] RECOVERY - HHVM busy threads on mw1247 is OK: OK: Less than 30.00% above the threshold [76.8] [14:32:18] hey [14:32:19] PROBLEM - HHVM busy threads on mw1071 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:32:29] PROBLEM - HHVM busy threads on mw1237 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [14:32:29] PROBLEM - HHVM queue size on mw1071 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:32:29] RECOVERY - HHVM queue size on mw1068 is OK: OK: Less than 30.00% above the threshold [10.0] [14:32:29] PROBLEM - HHVM busy threads on mw1176 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [14:32:39] 6operations: Intermittent 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635293 (10Lluis_tgn) Same for ca.wiki, en.wiki, phabricator... Now is OK. ``` Request: GET http://ca.wikipedia.org/wiki/Shakira, from 10.20.0.112 via cp1068 cp1068 ([10.64.0.105]:3128), Varnish XID 700001853 Forwarde... [14:32:44] http://ganglia.wikimedia.org/latest/?r=hour&tab=ch&hreg[]=^db1055 [14:32:50] RECOVERY - HHVM queue size on mw1247 is OK: OK: Less than 30.00% above the threshold [10.0] [14:32:59] PROBLEM - HHVM queue size on mw1258 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:33:09] PROBLEM - HHVM queue size on mw1029 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0] [14:33:09] RECOVERY - HHVM busy threads on mw1170 is OK: OK: Less than 30.00% above the threshold [76.8] [14:33:38] PROBLEM - HHVM busy threads on mw1243 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [14:33:40] PROBLEM - HHVM busy threads on mw1036 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:33:49] PROBLEM - HHVM busy threads on mw1242 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [14:33:49] PROBLEM - HHVM busy threads on mw1034 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [14:33:59] PROBLEM - HHVM busy threads on mw1064 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [14:34:03] s1 is in trouble [14:34:06] lots of queries [14:34:09] Heavey load? [14:34:09] PROBLEM - HHVM busy threads on mw1070 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [86.4] [14:34:16] I wonder why [14:34:18] yeah but it passed now so I don't know what it was [14:34:19] PROBLEM - HHVM queue size on mw1100 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [80.0] [14:34:28] PROBLEM - HHVM queue size on mw1055 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [80.0] [14:34:28] PROBLEM - HHVM busy threads on mw1216 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [115.2] [14:34:28] PROBLEM - HHVM busy threads on mw1055 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [86.4] [14:34:29] PROBLEM - HHVM queue size on mw1110 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:34:29] RECOVERY - HHVM busy threads on mw1176 is OK: OK: Less than 30.00% above the threshold [76.8] [14:34:30] PROBLEM - HHVM busy threads on mw1029 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:34:40] PROBLEM - HHVM busy threads on mw1241 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [14:34:59] PROBLEM - HHVM queue size on mw1025 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0] [14:34:59] PROBLEM - HHVM queue size on mw1051 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [80.0] [14:34:59] PROBLEM - HHVM busy threads on mw1046 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [86.4] [14:35:09] PROBLEM - HHVM busy threads on mw1214 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [14:35:21] RECOVERY - HHVM queue size on mw1253 is OK: OK: Less than 30.00% above the threshold [10.0] [14:35:21] PROBLEM - HHVM queue size on mw1074 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:35:22] PROBLEM - HHVM queue size on mw1241 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0] [14:35:22] PROBLEM - HHVM busy threads on mw1081 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:35:29] stupid checks [14:35:34] I'm cloning mediawiki core from gerrit hope that's not the issue [14:35:40] PROBLEM - HHVM busy threads on mw1110 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:35:40] RECOVERY - HHVM busy threads on mw1243 is OK: OK: Less than 30.00% above the threshold [76.8] [14:35:43] no, that's not the issue mafk :) [14:35:48] PROBLEM - HHVM queue size on mw1050 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:35:49] PROBLEM - HHVM queue size on mw1242 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:35:49] PROBLEM - HHVM busy threads on mw1102 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [86.4] [14:35:49] PROBLEM - HHVM queue size on mw1036 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:35:49] RECOVERY - HHVM busy threads on mw1034 is OK: OK: Less than 30.00% above the threshold [57.6] [14:35:58] PROBLEM - HHVM queue size on mw1081 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0] [14:35:58] PROBLEM - HHVM queue size on mw1022 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:35:58] PROBLEM - HHVM busy threads on mw1245 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [115.2] [14:35:58] PROBLEM - HHVM busy threads on mw1250 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [14:35:59] RECOVERY - HHVM busy threads on mw1064 is OK: OK: Less than 30.00% above the threshold [57.6] [14:36:09] PROBLEM - HHVM busy threads on mw1051 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:36:09] PROBLEM - HHVM queue size on mw1070 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [80.0] [14:36:18] PROBLEM - HHVM busy threads on mw1027 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:36:19] PROBLEM - HHVM busy threads on mw1104 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:36:29] PROBLEM - HHVM busy threads on mw1091 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:36:29] RECOVERY - HHVM busy threads on mw1216 is OK: OK: Less than 30.00% above the threshold [76.8] [14:36:29] PROBLEM - HHVM busy threads on mw1166 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [14:36:32] we had a centralnotice for this [14:36:40] PROBLEM - HHVM queue size on mw1188 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0] [14:36:43] Centralnotice? [14:37:00] PROBLEM - HHVM queue size on mw1024 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0] [14:37:08] RECOVERY - HHVM queue size on mw1025 is OK: OK: Less than 30.00% above the threshold [10.0] [14:37:08] RECOVERY - HHVM queue size on mw1258 is OK: OK: Less than 30.00% above the threshold [10.0] [14:37:09] RECOVERY - HHVM busy threads on mw1214 is OK: OK: Less than 30.00% above the threshold [76.8] [14:37:10] RECOVERY - HHVM queue size on mw1029 is OK: OK: Less than 30.00% above the threshold [10.0] [14:37:19] RECOVERY - HHVM queue size on mw1074 is OK: OK: Less than 30.00% above the threshold [10.0] [14:37:19] RECOVERY - HHVM busy threads on mw1081 is OK: OK: Less than 30.00% above the threshold [57.6] [14:37:40] RECOVERY - HHVM busy threads on mw1110 is OK: OK: Less than 30.00% above the threshold [57.6] [14:37:48] RECOVERY - HHVM queue size on mw1050 is OK: OK: Less than 30.00% above the threshold [10.0] [14:37:49] RECOVERY - HHVM queue size on mw1242 is OK: OK: Less than 30.00% above the threshold [10.0] [14:37:58] RECOVERY - HHVM queue size on mw1036 is OK: OK: Less than 30.00% above the threshold [10.0] [14:37:58] RECOVERY - HHVM busy threads on mw1242 is OK: OK: Less than 30.00% above the threshold [76.8] [14:37:58] RECOVERY - HHVM queue size on mw1022 is OK: OK: Less than 30.00% above the threshold [10.0] [14:37:59] RECOVERY - HHVM busy threads on mw1250 is OK: OK: Less than 30.00% above the threshold [76.8] [14:38:10] RECOVERY - HHVM queue size on mw1070 is OK: OK: Less than 30.00% above the threshold [10.0] [14:38:19] RECOVERY - HHVM busy threads on mw1070 is OK: OK: Less than 30.00% above the threshold [57.6] [14:38:28] RECOVERY - HHVM queue size on mw1100 is OK: OK: Less than 30.00% above the threshold [10.0] [14:38:29] RECOVERY - HHVM busy threads on mw1071 is OK: OK: Less than 30.00% above the threshold [57.6] [14:38:29] RECOVERY - HHVM queue size on mw1055 is OK: OK: Less than 30.00% above the threshold [10.0] [14:38:30] RECOVERY - HHVM busy threads on mw1091 is OK: OK: Less than 30.00% above the threshold [57.6] [14:38:30] RECOVERY - HHVM busy threads on mw1166 is OK: OK: Less than 30.00% above the threshold [76.8] [14:38:30] RECOVERY - HHVM busy threads on mw1055 is OK: OK: Less than 30.00% above the threshold [57.6] [14:38:38] RECOVERY - HHVM busy threads on mw1237 is OK: OK: Less than 30.00% above the threshold [76.8] [14:38:38] RECOVERY - HHVM queue size on mw1110 is OK: OK: Less than 30.00% above the threshold [10.0] [14:38:39] RECOVERY - HHVM queue size on mw1071 is OK: OK: Less than 30.00% above the threshold [10.0] [14:38:39] RECOVERY - HHVM busy threads on mw1029 is OK: OK: Less than 30.00% above the threshold [57.6] [14:38:48] RECOVERY - HHVM queue size on mw1188 is OK: OK: Less than 30.00% above the threshold [10.0] [14:38:49] RECOVERY - HHVM busy threads on mw1241 is OK: OK: Less than 30.00% above the threshold [76.8] [14:39:08] RECOVERY - HHVM queue size on mw1024 is OK: OK: Less than 30.00% above the threshold [10.0] [14:39:09] RECOVERY - HHVM queue size on mw1051 is OK: OK: Less than 30.00% above the threshold [10.0] [14:39:09] RECOVERY - HHVM busy threads on mw1046 is OK: OK: Less than 30.00% above the threshold [57.6] [14:39:19] Yikes [14:39:39] dewiki is down at the moment [14:39:48] Yep [14:39:51] Everything is down, Luke081515|away. [14:39:52] (I guess) [14:39:53] 6operations: Intermittent 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635297 (10faidon) Seems to have been a cascading failure: application servers were backed up waiting for databases which resulted into a full-outage. The underlying issue seems to have been a database overload across a... [14:39:56] oh, ok [14:39:56] down again? [14:39:59] RECOVERY - HHVM busy threads on mw1036 is OK: OK: Less than 30.00% above the threshold [57.6] [14:39:59] RECOVERY - HHVM busy threads on mw1102 is OK: OK: Less than 30.00% above the threshold [57.6] [14:40:00] RECOVERY - HHVM queue size on mw1081 is OK: OK: Less than 30.00% above the threshold [10.0] [14:40:00] RECOVERY - HHVM busy threads on mw1245 is OK: OK: Less than 30.00% above the threshold [76.8] [14:40:18] RECOVERY - HHVM busy threads on mw1051 is OK: OK: Less than 30.00% above the threshold [57.6] [14:40:20] RECOVERY - HHVM busy threads on mw1027 is OK: OK: Less than 30.00% above the threshold [57.6] [14:40:28] RECOVERY - HHVM busy threads on mw1104 is OK: OK: Less than 30.00% above the threshold [57.6] [14:41:03] So is it a bug, hardware or a 'slam'? [14:41:08] PROBLEM - HHVM rendering on mw1257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:09] PROBLEM - Apache HTTP on mw1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:09] PROBLEM - HHVM rendering on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:09] PROBLEM - HHVM rendering on mw1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:09] PROBLEM - HHVM rendering on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:09] PROBLEM - HHVM rendering on mw1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:09] PROBLEM - HHVM rendering on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:10] PROBLEM - HHVM rendering on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:10] PROBLEM - HHVM rendering on mw1108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:11] PROBLEM - HHVM rendering on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:11] PROBLEM - HHVM rendering on mw1101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:12] PROBLEM - HHVM rendering on mw1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:12] PROBLEM - HHVM rendering on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:13] PROBLEM - HHVM rendering on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:14] oh great [14:41:31] PROBLEM - HHVM rendering on mw1250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:31] RECOVERY - HHVM queue size on mw1241 is OK: OK: Less than 30.00% above the threshold [10.0] [14:41:39] o.O [14:41:41] be calm, everything will be solved [14:41:52] at the moment, dewiki works again [14:42:02] vandals for now can't edit [14:42:03] mafk: The high load looks unusual [14:42:32] ShakespeareFan00: I don't know, I hardly understand, but I trust our ops team. [14:42:58] RECOVERY - HHVM rendering on mw1081 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.141 second response time [14:42:59] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [14:42:59] RECOVERY - HHVM rendering on mw1067 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.142 second response time [14:42:59] RECOVERY - HHVM rendering on mw1065 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.125 second response time [14:42:59] RECOVERY - HHVM rendering on mw1101 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.172 second response time [14:43:00] RECOVERY - HHVM rendering on mw1057 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.151 second response time [14:43:00] RECOVERY - HHVM rendering on mw1108 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.147 second response time [14:43:00] RECOVERY - HHVM rendering on mw1086 is OK: HTTP OK: HTTP/1.1 200 OK - 65406 bytes in 0.139 second response time [14:43:00] RECOVERY - HHVM rendering on mw1257 is OK: HTTP OK: HTTP/1.1 200 OK - 65406 bytes in 0.133 second response time [14:43:01] RECOVERY - HHVM rendering on mw1150 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.162 second response time [14:43:01] RECOVERY - HHVM rendering on mw1044 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.124 second response time [14:43:02] RECOVERY - HHVM rendering on mw1025 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.242 second response time [14:43:02] RECOVERY - HHVM rendering on mw1075 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.251 second response time [14:43:03] RECOVERY - HHVM rendering on mw1054 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.131 second response time [14:43:08] argh [14:43:13] See, everything is fine [14:43:18] RECOVERY - Apache HTTP on mw1246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.027 second response time [14:43:20] RECOVERY - HHVM rendering on mw1250 is OK: HTTP OK: HTTP/1.1 200 OK - 65406 bytes in 0.103 second response time [14:43:21] we know there's a problem, I'm investigating [14:44:12] Bsadowski1: be more careful next time with the switches :P [14:44:49] :P [14:44:59] But I'm at home [14:45:18] PROBLEM - HHVM busy threads on mw1244 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [115.2] [14:45:29] PROBLEM - HHVM queue size on mw1253 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0] [14:46:29] PROBLEM - HHVM queue size on mw1093 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:46:39] PROBLEM - HHVM queue size on mw1063 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:46:39] PROBLEM - HHVM queue size on mw1239 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0] [14:46:49] PROBLEM - HHVM busy threads on mw1073 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:46:50] PROBLEM - HHVM queue size on mw1042 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:46:50] PROBLEM - HHVM queue size on mw1149 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0] [14:46:59] PROBLEM - HHVM queue size on mw1088 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:46:59] PROBLEM - HHVM queue size on mw1039 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0] [14:47:18] PROBLEM - HHVM busy threads on mw1083 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [86.4] [14:47:29] PROBLEM - HHVM queue size on mw1085 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:47:29] PROBLEM - HHVM queue size on mw1082 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [80.0] [14:47:29] PROBLEM - HHVM queue size on mw1214 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:47:39] PROBLEM - HHVM queue size on mw1072 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0] [14:47:58] PROBLEM - HHVM busy threads on mw1243 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [115.2] [14:47:58] PROBLEM - HHVM busy threads on mw1054 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:47:59] PROBLEM - HHVM queue size on mw1113 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:47:59] PROBLEM - HHVM queue size on mw1103 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [80.0] [14:48:09] PROBLEM - HHVM busy threads on mw1246 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2] [14:48:18] PROBLEM - HHVM busy threads on mw1256 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [115.2] [14:48:18] PROBLEM - HHVM busy threads on mw1106 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:48:18] PROBLEM - HHVM busy threads on mw1064 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:48:28] PROBLEM - HHVM busy threads on mw1107 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [14:48:28] PROBLEM - HHVM busy threads on mw1093 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [86.4] [14:48:28] PROBLEM - HHVM queue size on mw1168 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0] [14:48:29] PROBLEM - HHVM busy threads on mw1098 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:48:39] PROBLEM - HHVM busy threads on mw1065 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:48:39] PROBLEM - HHVM queue size on mw1073 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [80.0] [14:48:49] PROBLEM - HHVM queue size on mw1171 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:48:59] PROBLEM - HHVM queue size on mw1092 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0] [14:49:00] PROBLEM - HHVM queue size on mw1188 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0] [14:49:00] PROBLEM - HHVM queue size on mw1151 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [14:49:18] RECOVERY - HHVM busy threads on mw1083 is OK: OK: Less than 30.00% above the threshold [57.6] [14:49:18] RECOVERY - HHVM busy threads on mw1244 is OK: OK: Less than 30.00% above the threshold [76.8] [14:49:29] PROBLEM - HHVM busy threads on mw1039 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [14:49:30] PROBLEM - HHVM queue size on mw1018 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0] [14:49:30] RECOVERY - HHVM queue size on mw1085 is OK: OK: Less than 30.00% above the threshold [10.0] [14:49:30] RECOVERY - HHVM queue size on mw1082 is OK: OK: Less than 30.00% above the threshold [10.0] [14:49:38] RECOVERY - HHVM queue size on mw1253 is OK: OK: Less than 30.00% above the threshold [10.0] [14:49:38] RECOVERY - HHVM queue size on mw1214 is OK: OK: Less than 30.00% above the threshold [10.0] [14:49:40] PROBLEM - HHVM busy threads on mw1108 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4] [14:49:40] RECOVERY - HHVM queue size on mw1072 is OK: OK: Less than 30.00% above the threshold [10.0] [14:49:59] RECOVERY - HHVM busy threads on mw1243 is OK: OK: Less than 30.00% above the threshold [76.8] [14:49:59] RECOVERY - HHVM busy threads on mw1054 is OK: OK: Less than 30.00% above the threshold [57.6] [14:49:59] RECOVERY - HHVM queue size on mw1113 is OK: OK: Less than 30.00% above the threshold [10.0] [14:49:59] RECOVERY - HHVM queue size on mw1103 is OK: OK: Less than 30.00% above the threshold [10.0] [14:50:10] RECOVERY - HHVM busy threads on mw1246 is OK: OK: Less than 30.00% above the threshold [76.8] [14:50:18] RECOVERY - HHVM busy threads on mw1106 is OK: OK: Less than 30.00% above the threshold [57.6] [14:50:19] RECOVERY - HHVM busy threads on mw1256 is OK: OK: Less than 30.00% above the threshold [76.8] [14:50:28] RECOVERY - HHVM busy threads on mw1107 is OK: OK: Less than 30.00% above the threshold [57.6] [14:50:31] RECOVERY - HHVM busy threads on mw1093 is OK: OK: Less than 30.00% above the threshold [57.6] [14:50:31] RECOVERY - HHVM queue size on mw1168 is OK: OK: Less than 30.00% above the threshold [10.0] [14:50:31] RECOVERY - HHVM queue size on mw1093 is OK: OK: Less than 30.00% above the threshold [10.0] [14:50:31] RECOVERY - HHVM busy threads on mw1098 is OK: OK: Less than 30.00% above the threshold [57.6] [14:50:39] RECOVERY - HHVM busy threads on mw1065 is OK: OK: Less than 30.00% above the threshold [57.6] [14:50:40] RECOVERY - HHVM queue size on mw1063 is OK: OK: Less than 30.00% above the threshold [10.0] [14:50:40] RECOVERY - HHVM queue size on mw1073 is OK: OK: Less than 30.00% above the threshold [10.0] [14:50:40] RECOVERY - HHVM queue size on mw1239 is OK: OK: Less than 30.00% above the threshold [10.0] [14:50:50] RECOVERY - HHVM queue size on mw1171 is OK: OK: Less than 30.00% above the threshold [10.0] [14:50:58] RECOVERY - HHVM busy threads on mw1073 is OK: OK: Less than 30.00% above the threshold [57.6] [14:50:59] RECOVERY - HHVM queue size on mw1042 is OK: OK: Less than 30.00% above the threshold [10.0] [14:50:59] RECOVERY - HHVM queue size on mw1149 is OK: OK: Less than 30.00% above the threshold [10.0] [14:50:59] RECOVERY - HHVM queue size on mw1188 is OK: OK: Less than 30.00% above the threshold [10.0] [14:50:59] RECOVERY - HHVM queue size on mw1092 is OK: OK: Less than 30.00% above the threshold [10.0] [14:51:00] RECOVERY - HHVM queue size on mw1151 is OK: OK: Less than 30.00% above the threshold [10.0] [14:51:08] RECOVERY - HHVM queue size on mw1039 is OK: OK: Less than 30.00% above the threshold [10.0] [14:51:08] RECOVERY - HHVM queue size on mw1088 is OK: OK: Less than 30.00% above the threshold [10.0] [14:51:29] RECOVERY - HHVM queue size on mw1018 is OK: OK: Less than 30.00% above the threshold [10.0] [14:51:41] RECOVERY - HHVM busy threads on mw1108 is OK: OK: Less than 30.00% above the threshold [57.6] [14:52:19] RECOVERY - HHVM busy threads on mw1064 is OK: OK: Less than 30.00% above the threshold [57.6] [14:53:29] RECOVERY - HHVM busy threads on mw1039 is OK: OK: Less than 30.00% above the threshold [57.6] [14:53:34] "mwoauthdatastore-callback-not-found" [14:57:09] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:09:35] Things seemingly a bit more stable? [15:09:43] 6operations: Intermittent 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635329 (10faidon) 5Open>3Resolved a:3faidon The cause was determined to be an attack (in three bursts) on our servers, in a successful attempt to overload them. I won't say more — we have a policy on not document... [15:12:21] paravoid: any account you wish to have globally locked or IPs globally blocked? [15:12:40] no, thanks [15:12:53] ok, please ping if you change your mind :) [15:13:01] and thanks for resolving this [15:13:55] mafk: So what happened? I appreciate there may not be a lot you can says for operational reasons [15:14:11] ShakespeareFan00: I know as much as you. [15:14:20] --> afk [15:15:23] 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1635337 (10Aklapper) Well, if it's 2.40.10 it'll also fix T112421 and T97758 [15:16:24] 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, 5ContentTranslation-Release6: Apertium Failed to load resource: net::ERR_SPDY_PROTOCOL_ERROR - https://phabricator.wikimedia.org/T112403#1635339 (10Oscar) 5Open>3Resolved a:3Oscar It was the webshield... [15:17:41] ^ Avast seems to have issues [15:18:07] I've logged a bug in Mozilla's Bugzilla concerning it's apparent incompatability with Firefox [15:18:16] 6operations: Intermittent 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635345 (10Luke081515) @faidon Thanks for quick resolving [15:18:52] The view I got in #firefox about Avast was it was also causing problems for Firefox support people as well [15:20:50] Might be worth looking into that because an anti virus shouldn't break normal use ;) [15:20:53] * ShakespeareFan00 out [15:34:04] 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1635353 (10Ricordisamoa) >>! In T112421#1635337, @Aklapper wrote: > Well, if it's 2.40.10 it'll also fix T112421 and T97758 The first one is this one. [15:46:59] (03PS1) 10Andrew Bogott: Bump disk_allocation_ratio up a bit more. [puppet] - 10https://gerrit.wikimedia.org/r/237995 (https://phabricator.wikimedia.org/T111988) [15:48:02] (03CR) 10Andrew Bogott: [C: 032] Bump disk_allocation_ratio up a bit more. [puppet] - 10https://gerrit.wikimedia.org/r/237995 (https://phabricator.wikimedia.org/T111988) (owner: 10Andrew Bogott) [15:56:31] paravoid: DDOS season started again about two weeks ago :-( [16:00:07] (03PS1) 10Faidon Liambotis: mediawiki: kill HHVM graphite checks [puppet] - 10https://gerrit.wikimedia.org/r/237998 [16:00:25] (03PS2) 10Faidon Liambotis: mediawiki: kill HHVM graphite checks [puppet] - 10https://gerrit.wikimedia.org/r/237998 [16:12:29] PROBLEM - puppet last run on mw2086 is CRITICAL: CRITICAL: puppet fail [16:26:19] (03CR) 10Brian Wolff: [C: 031] Add *.ggpht.com to Wikimedia Commons upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234980 (https://phabricator.wikimedia.org/T110869) (owner: 10Dereckson) [16:38:48] RECOVERY - puppet last run on mw2086 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:50:35] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1635456 (10Jalexander) What's the planned timeline for the migration/change? Still planned for this coming week? [16:52:41] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1635457 (10JohnLewis) >>! In T110949#1635456, @Jalexander wrote: > What's the planned timeline for the migration/change? Still planned for this coming week? To my knowledge, we've still not resch... [17:06:17] 6operations, 7Database: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#1635469 (10jcrespo) 3NEW a:3jcrespo [17:06:26] 6operations, 7Database: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#1635477 (10jcrespo) p:5Triage>3High [17:11:10] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 86 data above and 8 below the confidence bounds [17:17:29] PROBLEM - puppet last run on mw2010 is CRITICAL: CRITICAL: puppet fail [17:45:58] RECOVERY - puppet last run on mw2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:48:30] 6operations, 7Database: Upgrade db1055 mysql version and configuration, and reduce its pool weight - https://phabricator.wikimedia.org/T112478#1635580 (10jcrespo) 3NEW a:3jcrespo [17:52:04] 6operations, 7Database: Check, test and tune pool-of-connections and max_connections configuration - https://phabricator.wikimedia.org/T112479#1635595 (10jcrespo) 3NEW a:3jcrespo [18:08:11] ori, please disable https://phabricator.wikimedia.org/p/Jmiguel2902/ [18:09:15] user is blocked as a vandalism-only account on mediawiki.org too [18:12:20] 6operations, 6Release-Engineering-Team: tin disk space at 5% - https://phabricator.wikimedia.org/T112391#1635641 (10jcrespo) Original report of <6% disk space alert is gone. I am retiring operations from the projects (sadly, we do not have a "Done" column), and let tin's users full control of it (close the tic... [18:21:23] 6operations, 7Database: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#1635662 (10jcrespo) In particular, not only should we monitor the current connections, but the peak since the last check, as otherwise it may be undetected for... [18:47:18] RECOVERY - Disk space on labstore1002 is OK: DISK OK [18:49:33] Krenair: k, done [18:49:46] thanks [19:21:49] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: puppet fail [19:48:09] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [21:03:43] 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1635839 (10Aklapper) @RobH: So... what's left to do here? (Task is still open but last comment says "just reopen this task".) [21:08:18] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [21:08:33] 6operations: audit hr staff and tracking sheet (2015-08-17 revision) against shell access/ldap wmf group - https://phabricator.wikimedia.org/T109382#1635845 (10JohnLewis) [21:08:35] 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1635843 (10JohnLewis) 5Open>3Resolved @aklapper nothing. seems he just forgot to resolve the task. [21:10:19] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10772 bytes in 0.334 second response time [21:20:35] (03PS1) 10Ori.livneh: HHVM: enable stats collection for MySQL usage on canary servers [puppet] - 10https://gerrit.wikimedia.org/r/238073 [21:21:34] (03CR) 10Ori.livneh: [C: 032 V: 032] HHVM: enable stats collection for MySQL usage on canary servers [puppet] - 10https://gerrit.wikimedia.org/r/238073 (owner: 10Ori.livneh) [21:23:51] 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1635868 (10Tgr) [21:24:06] 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1634588 (10Tgr) [21:54:05] !bash I spotted another one of these: Criminal Minds, "The Hunt", at timestamp 26:10 on Netflix. Apparently the NewPP limit report is part of the login process for some underground girl-auctioning site. [22:44:06] (03CR) 10Ori.livneh: [C: 031] "Good catch." [debs/pybal] - 10https://gerrit.wikimedia.org/r/237986 (https://phabricator.wikimedia.org/T112457) (owner: 10Faidon Liambotis)