[00:03:54] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL check_failover servers up 2 down 1 [01:28:13] PROBLEM - puppet last run on mw2020 is CRITICAL puppet fail [01:52:20] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1522032 (10BBlack) I think the problems with email.donate run deeper than anything IBM could fix without changes on our side as well.... [01:55:10] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1522035 (10BBlack) If the solution is to convert all of the links to foo.mkt4988.com, I think that's still a pretty poor solution (it... [01:56:04] RECOVERY - puppet last run on mw2020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:20:01] !log l10nupdate Synchronized php-1.26wmf17/cache/l10n: l10nupdate for 1.26wmf17 (duration: 06m 32s) [02:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:22] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf17) at 2015-08-09 02:23:22+00:00 [02:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:50:33] PROBLEM - BGP status on cr2-eqiad is CRITICAL host 208.80.154.197, sessions up: 73, down: 3, shutdown: 0BRPeering with AS1273 not established - CWBRPeering with AS8218 not established - NEO-ASNBRPeering with AS62651 not established - BR [03:32:23] PROBLEM - BGP status on cr2-eqiad is CRITICAL host 208.80.154.197, sessions up: 73, down: 3, shutdown: 0BRPeering with AS1273 not established - CWBRPeering with AS8218 not established - NEO-ASNBRPeering with AS62651 not established - BR [03:35:34] PROBLEM - puppet last run on mw2187 is CRITICAL puppet fail [03:36:23] PROBLEM - BGP status on cr2-eqiad is CRITICAL host 208.80.154.197, sessions up: 73, down: 3, shutdown: 0BRPeering with AS1273 not established - CWBRPeering with AS8218 not established - NEO-ASNBRPeering with AS62651 not established - BR [03:36:23] PROBLEM - puppet last run on mw2181 is CRITICAL Puppet has 1 failures [03:37:04] PROBLEM - puppet last run on mw1050 is CRITICAL Puppet has 1 failures [04:02:43] RECOVERY - puppet last run on mw1050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:03:54] RECOVERY - puppet last run on mw2181 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:05:05] RECOVERY - puppet last run on mw2187 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:38:09] (03PS1) 10Yuvipanda: aptly: Add module + simple role [puppet] - 10https://gerrit.wikimedia.org/r/230375 (https://phabricator.wikimedia.org/T104194) [04:38:23] (03PS2) 10Yuvipanda: aptly: Add module + simple role [puppet] - 10https://gerrit.wikimedia.org/r/230375 (https://phabricator.wikimedia.org/T104194) [04:39:26] (03PS1) 10Tim Landscheidt: Tools: Puppetize that gridengine resource h_vmem is consumable [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) [05:06:33] (03CR) 10Yuvipanda: [C: 032] aptly: Add module + simple role [puppet] - 10https://gerrit.wikimedia.org/r/230375 (https://phabricator.wikimedia.org/T104194) (owner: 10Yuvipanda) [05:09:48] (03PS1) 10Yuvipanda: aptly: Actually include the module in the role [puppet] - 10https://gerrit.wikimedia.org/r/230379 [05:09:53] (03CR) 10jenkins-bot: [V: 04-1] aptly: Actually include the module in the role [puppet] - 10https://gerrit.wikimedia.org/r/230379 (owner: 10Yuvipanda) [05:10:00] (03PS2) 10Yuvipanda: aptly: Actually include the module in the role [puppet] - 10https://gerrit.wikimedia.org/r/230379 [05:10:08] (03CR) 10Yuvipanda: [C: 032 V: 032] aptly: Actually include the module in the role [puppet] - 10https://gerrit.wikimedia.org/r/230379 (owner: 10Yuvipanda) [05:12:23] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Aug 9 05:12:23 UTC 2015 (duration 12m 22s) [05:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:08:33] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 1764 bytes in 0.876 second response time [06:08:46] > Failed to create a temporary directory: the disk is full. [06:09:18] hmm [06:12:24] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 21128 bytes in 0.239 second response time [06:13:06] I tossed a pbaricator access log but thos elogs should get rotated more often [06:13:29] with a size limit too [06:14:38] why would pacct logs be large? hmm [06:15:54] RECOVERY - Disk space on iridium is OK: DISK OK [06:16:11] 6operations: rotate phab access logs more often on iridium - https://phabricator.wikimedia.org/T108503#1522159 (10ArielGlenn) 3NEW [06:16:40] yep that's the other issue. [06:16:52] between those two the disk gets filled [06:17:17] I can't stick around this morning, but I'll check the scrollback later [06:17:19] !log moved some log files on iridium into /srv/logs to free space on / [06:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:17:51] the root partition is awfully small [06:17:56] ridiculously small really [06:17:56] yep [06:18:05] we could just log to /srv/something [06:20:19] !log restarted phabricator phd (just in case - the full partition may have caused the daemons to be in a broken state) [06:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:22:40] yawn [06:23:12] my luck, wake up and 1 minute later (literally) get a page. anyways thanks for doing that [06:30:54] PROBLEM - puppet last run on cp2013 is CRITICAL Puppet has 1 failures [06:31:23] PROBLEM - puppet last run on wtp2017 is CRITICAL Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on mc1017 is CRITICAL Puppet has 2 failures [06:33:14] PROBLEM - puppet last run on db2064 is CRITICAL Puppet has 1 failures [06:33:24] PROBLEM - puppet last run on mw2158 is CRITICAL Puppet has 1 failures [06:33:44] PROBLEM - puppet last run on mw2207 is CRITICAL Puppet has 1 failures [06:33:44] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:34:04] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 1 failures [06:34:04] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:42:59] (03PS1) 1020after4: Rotate apache2 logs more often on phabricator web host [puppet] - 10https://gerrit.wikimedia.org/r/230382 [06:43:29] apergos: ^ [06:44:27] (03PS2) 1020after4: Rotate apache2 logs more often on phabricator web host [puppet] - 10https://gerrit.wikimedia.org/r/230382 [06:46:25] 6operations: rotate phab access logs more often on iridium - https://phabricator.wikimedia.org/T108503#1522175 (10mmodell) https://gerrit.wikimedia.org/r/#/c/230382/ will rotate the apache logs more aggressively. [06:46:43] (03PS3) 1020after4: Rotate apache2 logs more often on phabricator web host [puppet] - 10https://gerrit.wikimedia.org/r/230382 (https://phabricator.wikimedia.org/T108503) [06:55:14] RECOVERY - puppet last run on db2064 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:55:53] RECOVERY - puppet last run on mc1017 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on wtp2017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:57:43] RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:43] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:44] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:24] PROBLEM - puppet last run on mw1035 is CRITICAL Puppet has 1 failures [07:20:04] PROBLEM - puppet last run on ruthenium is CRITICAL Puppet has 25 failures [07:30:13] RECOVERY - puppet last run on mw1035 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:34:53] RECOVERY - haproxy failover on dbproxy1003 is OK check_failover servers up 2 down 0 [07:53:19] 6operations, 6Analytics-Engineering: Honor DNT header for access logs & varnish logs - https://phabricator.wikimedia.org/T98831#1522213 (10Tgr) EFF just published a [[ https://www.eff.org/dnt-policy | DNT policy template ]] which sets out expectations for the behavior of DNT-friendly sites. For web server logs... [08:01:53] (03Draft2) 10Giuseppe Lavagetto: apache: allow tuning of logrotate parameters [puppet] - 10https://gerrit.wikimedia.org/r/230384 (https://phabricator.wikimedia.org/T108503) [08:03:14] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I hate when we overwrite system-provided files just to change one line in them." [puppet] - 10https://gerrit.wikimedia.org/r/230382 (https://phabricator.wikimedia.org/T108503) (owner: 1020after4) [08:03:57] <_joe_> apergos: since twentyafterfour was referring to you for that change, please look at what I did instead :) [08:15:54] RECOVERY - puppet last run on ruthenium is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [08:23:40] 6operations, 10hardware-requests, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1522236 (10jcrespo) Some nodes, like es1009 are down to 6% available disk space. [08:36:20] hi m4tx [08:45:33] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 7 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1522248 (10McZusatz) [09:23:51] 6operations: Test and fix db1047 BBU - https://phabricator.wikimedia.org/T103345#1522304 (10Peachey88) [10:30:24] (03PS1) 10Amire80: Add Wikimedia Australia blog to the English Planet [puppet] - 10https://gerrit.wikimedia.org/r/230391 [10:31:17] (03PS2) 10Amire80: Add Wikimedia Australia blog to the English Planet [puppet] - 10https://gerrit.wikimedia.org/r/230391 [11:23:34] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 726.230263349 [11:45:09] (03CR) 10Glaisher: [C: 031] Add Wikimedia Australia blog to the English Planet [puppet] - 10https://gerrit.wikimedia.org/r/230391 (owner: 10Amire80) [11:53:04] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL - load average: 103.83, 101.02, 97.70 [12:07:14] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL - load average: 99.63, 100.11, 98.71 [12:17:13] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL - load average: 107.87, 101.23, 99.35 [12:20:48] hi Negative24 [12:20:53] and Nemo_bis [12:45:07] * hoo wonders whether ori is here [12:53:58] (03PS1) 10Hoo man: Set dispatchBatchChunkFactor to 10 for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230401 [12:55:22] (03CR) 10Hoo man: [C: 032] ":/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230401 (owner: 10Hoo man) [12:55:27] (03Merged) 10jenkins-bot: Set dispatchBatchChunkFactor to 10 for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230401 (owner: 10Hoo man) [12:56:41] !log hoo Synchronized php-1.26wmf17/extensions/Wikidata/: Set dispatchBatchChunkFactor to 10 for now (duration: 00m 20s) [12:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:25:24] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL - load average: 104.03, 101.28, 98.65 [13:33:34] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL - load average: 104.57, 101.70, 99.47 [13:41:34] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL - load average: 103.91, 100.60, 99.48 [13:47:34] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL - load average: 107.55, 101.91, 100.19 [14:07:06] (03CR) 1020after4: [C: 031] apache: allow tuning of logrotate parameters [puppet] - 10https://gerrit.wikimedia.org/r/230384 (https://phabricator.wikimedia.org/T108503) (owner: 10Giuseppe Lavagetto) [14:07:44] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL - load average: 105.06, 100.32, 100.03 [14:10:00] (03CR) 1020after4: "This is much better than the way I did it :)" [puppet] - 10https://gerrit.wikimedia.org/r/230384 (https://phabricator.wikimedia.org/T108503) (owner: 10Giuseppe Lavagetto) [14:57:48] m4tx: hello [15:33:16] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1522688 (10Krenair) I think trackBlobs would currently fail the initial integrity check (at least on dewiki and probably other wikis which are old enough) due to the presence of HistoryBlobStub obje... [15:56:18] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1522705 (10Krenair) ```mysql:wikiadmin@db2038.codfw.wmnet [dewiki]> select count(*) from text where old_flags LIKE '%object%' AND old_flags NOT LIKE '%external%' AND LOWER(CONVERT(LEFT(old_text,22)... [17:01:54] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1412 bytes in 0.315 second response time [17:05:04] PROBLEM - puppet last run on elastic1014 is CRITICAL puppet fail [17:28:24] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL 25.93% of data above the critical threshold [100000000.0] [17:31:03] RECOVERY - puppet last run on elastic1014 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:57:39] !log issuing nodetool cleanup on restbase1006 [17:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:58:24] RECOVERY - Outgoing network saturation on labstore1002 is OK Less than 10.00% above the threshold [75000000.0] [18:10:27] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1522904 (10Steinsplitter) >>! In T94694#1434282, @VictorGrigas wrote: > @Steinsplitter - Yes, actually I'll be photographing (and hopefully shooting video of) the servers on Dallas on the way back from Wikimania. I'll have that media onli... [18:13:23] PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=84%) [18:14:13] 6operations, 7network: Set up cr1-eqord & cr1-eqdfw - https://phabricator.wikimedia.org/T89227#1522913 (10faidon) These are now set up. They are not connected yet (and thus, they're untested). @papaul and myself are going to do that this week, before we can consider this task done. [19:09:29] (03PS2) 10Tim Landscheidt: Tools: Puppetize that gridengine resource h_vmem is consumable [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) [19:11:32] (03Abandoned) 10Merlijn van Deen: [WIP] [tools/apt] Multiversioning for Python 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/221457 (owner: 10Merlijn van Deen) [19:16:24] (03CR) 10Tim Landscheidt: "Tested (and currently cherry-picked) on Toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) (owner: 10Tim Landscheidt) [19:29:18] (03CR) 10Tim Landscheidt: [C: 04-1] "And I've only now noticed that there is logic for some of that in gridengine::master, but a) that is an echo currently and b) even that ec" [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) (owner: 10Tim Landscheidt) [19:29:42] (03CR) 10Merlijn van Deen: [C: 031] "looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) (owner: 10Tim Landscheidt) [20:14:54] PROBLEM - puppet last run on mw1190 is CRITICAL Puppet has 1 failures [20:16:01] (03PS1) 10BBlack: Release 1.9.3-1+wmf2 (newer multicert) [software/nginx] (wmf-1.9.3-1) - 10https://gerrit.wikimedia.org/r/230469 [20:17:56] (03CR) 10Tim Landscheidt: "@valhallasw: If it doesn't (not possible for h_vmem as it required by SGE), you would have to supply *all* attributes, which would cause P" [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) (owner: 10Tim Landscheidt) [20:34:10] (03CR) 10BBlack: [C: 04-1] "This needs a follow-on patch to update the stream proxy and ssl modules for s/ngx_ssl_certificate/ngx_ssl_certificates/ and related bits" [software/nginx] (wmf-1.9.3-1) - 10https://gerrit.wikimedia.org/r/230469 (owner: 10BBlack) [20:41:05] RECOVERY - puppet last run on mw1190 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:53:35] (03PS3) 10Tim Landscheidt: Tools: Puppetize gridengine complex configuration [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) [21:01:11] (03PS4) 10Tim Landscheidt: Tools: Puppetize gridengine complex configuration [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) [21:04:39] (03PS5) 10Tim Landscheidt: Tools: Puppetize gridengine complex configuration [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) [21:14:55] (03CR) 10Tim Landscheidt: "modules/gridengine/files/complex-99-default was identical to the factory defaults. On Toolsbeta, after resetting to the factory defaults " [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) (owner: 10Tim Landscheidt) [21:31:47] (03PS7) 10Wpmirrordev: Extend maximum allowed mediawiki version to 1.26 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/171976 (https://phabricator.wikimedia.org/T68661) [22:32:34] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (113366s 100000s) [22:40:52] apergos: ping [22:43:55] hi mafk [22:44:08] hi Krenair [22:48:42] mafk, did you see https://phabricator.wikimedia.org/T93562 ? [22:49:44] Krenair: yes, I'm thinking if I should create a new address. [22:55:03] PROBLEM - BGP status on cr2-eqiad is CRITICAL host 208.80.154.197, sessions up: 73, down: 3, shutdown: 0BRPeering with AS1273 not established - CWBRPeering with AS8218 not established - NEO-ASNBRPeering with AS62651 not established - BR [23:25:55] (03PS2) 10BBlack: Release 1.9.3-1+wmf2 (newer multicert patches) [software/nginx] (wmf-1.9.3-1) - 10https://gerrit.wikimedia.org/r/230469 [23:25:57] (03PS1) 10BBlack: Update multi-cert patches to Apr 27 version from nginx-devel list [software/nginx] (wmf-1.9.3-1) - 10https://gerrit.wikimedia.org/r/230472 [23:25:59] (03PS1) 10BBlack: Add multicert changes to the new stream modules [software/nginx] (wmf-1.9.3-1) - 10https://gerrit.wikimedia.org/r/230473 [23:43:24] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [23:47:28] (03PS3) 10BBlack: HTTP/2 alpha patch [software/nginx] (wmf-1.9.3-1-h2) - 10https://gerrit.wikimedia.org/r/230040 [23:57:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]