[00:06:35] JohnLewis, I guess it's possible that shipping to eqiad was suggested, but solr3 went to codfw instead [00:06:47] possible [00:16:32] either way the thing about capella is still a bit strange [00:17:43] at least we know the server is unused in a rack in codfw [00:19:47] JohnLewis, clue [00:20:10] https://wikitech.wikimedia.org/w/index.php?title=Server_Spares&diff=152418&oldid=152413 [00:20:37] PowerEdge R420 [00:20:51] https://wikitech.wikimedia.org/wiki/Mobile1 [00:20:57] Poweredge 1950 [00:21:15] not the same then :) [00:21:32] plus the paper trail of capella -> solr3 -> Dell is helpful [00:44:35] 6operations, 10Wikimedia-General-or-Unknown, 7database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1201372 (10aaron) >>! In T26675#1152976, @Krenair wrote: > I suspect that if we had this blob laying around before, it may have been... [01:15:16] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [01:54:31] (03CR) 10Dereckson: "Changes in CommonSettings.php and InitialiseSettings.php looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt) [02:08:19] (03CR) 10Dereckson: [C: 04-1] "Now the security review is done, the first step before a live deployment is to deploy it on the beta cluster to see all works really fine:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt) [02:10:08] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL 21.43% of data above the critical threshold [100000000.0] [02:21:18] !log l10nupdate Synchronized php-1.25wmf24/cache/l10n: (no message) (duration: 06m 23s) [02:21:32] Logged the message, Master [02:26:10] !log LocalisationUpdate completed (1.25wmf24) at 2015-04-12 02:25:07+00:00 [02:26:17] Logged the message, Master [02:29:57] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [02:37:37] RECOVERY - Outgoing network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [02:41:52] !log l10nupdate Synchronized php-1.26wmf1/cache/l10n: (no message) (duration: 05m 44s) [02:41:57] Logged the message, Master [02:46:18] !log LocalisationUpdate completed (1.26wmf1) at 2015-04-12 02:45:15+00:00 [02:46:22] Logged the message, Master [02:50:57] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:35:27] PROBLEM - puppet last run on mw1074 is CRITICAL Puppet has 1 failures [03:35:47] PROBLEM - puppet last run on mw1163 is CRITICAL Puppet has 1 failures [03:36:37] PROBLEM - puppet last run on mw1087 is CRITICAL Puppet has 1 failures [03:41:56] PROBLEM - puppet last run on mw1093 is CRITICAL Puppet has 1 failures [03:42:57] PROBLEM - puppet last run on mw2064 is CRITICAL Puppet has 1 failures [03:48:07] PROBLEM - puppet last run on es2009 is CRITICAL puppet fail [03:52:06] PROBLEM - puppet last run on mw1169 is CRITICAL Puppet has 1 failures [03:57:58] RECOVERY - puppet last run on mw1074 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [03:58:27] RECOVERY - puppet last run on mw1163 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [03:59:06] RECOVERY - puppet last run on mw2064 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [03:59:16] RECOVERY - puppet last run on mw1087 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [03:59:36] RECOVERY - puppet last run on mw1093 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [04:00:07] RECOVERY - puppet last run on mw1169 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [04:05:57] RECOVERY - puppet last run on es2009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:29:17] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [04:30:56] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 42.93 ms [04:47:26] 6operations, 7database: Better backup coverage for X1 database cluster - https://phabricator.wikimedia.org/T95835#1201504 (10Mattflaschen) 3NEW [04:49:35] 6operations, 7database: Better backup coverage for X1 database cluster - https://phabricator.wikimedia.org/T95835#1201513 (10Mattflaschen) [05:28:48] anyone know how we manage clock drift in the apache cluster? if i'm processing some data, how much fudge should i put in for that when guessing at an order of operations? [05:29:36] i can probably declare something with 10s timestamp difference to be ordered, but what about 5s? 2s/ (the timestamps i have are ms) [05:30:05] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Apr 12 05:29:02 UTC 2015 (duration 29m 1s) [05:30:12] Logged the message, Master [05:32:26] ebernhardson: hey! I’m looking and it looks like we run ntp on almost all our servers [05:32:28] * YuviPanda|zzz checks agan [05:33:01] yeah, they’re on apaches [05:33:04] now to see how it was [05:34:31] ntp is a good sign, i think in general it keeps a cluster pretty darn close if run regularly [05:35:09] ebernhardson: yes, and we have a deamon running [05:35:32] ebernhardson: I’m going to run a salt ‘date’ on all our apaches via salt and see how we’re doing [05:36:23] YuviPanda|zzz: excellent, thanks [05:36:38] ebernhardson: we use ntp and it's pretty good afaik...there is $wgClockSkewFudge [05:37:00] ebernhardson: they seem in sync to me from a cursory look [05:37:40] YuviPanda|zzz: thanks! [05:38:19] yw [05:41:00] cheesecat: wow, that is used in exactly one place. you really know your esoteric pieces of mediawiki :) thanks too [05:41:21] * cheesecat is just a cat made of cheese [05:41:58] hehe :) [05:43:35] http://en.wikipedia.org/wiki/Chechil ! [05:56:27] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [05:58:17] cheesecat: is that what MaxSem had?! [06:01:17] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60615 bytes in 3.966 second response time [06:08:35] YuviPanda|zzz, yes:P [06:08:40] :D [06:30:07] PROBLEM - puppet last run on mw2066 is CRITICAL Puppet has 1 failures [06:30:07] PROBLEM - puppet last run on cp3042 is CRITICAL Puppet has 2 failures [06:31:16] PROBLEM - puppet last run on db2036 is CRITICAL Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on mw2114 is CRITICAL Puppet has 1 failures [06:34:36] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 4 failures [06:34:57] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:35:27] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:46:17] RECOVERY - puppet last run on mw2114 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on mw2066 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:47:37] RECOVERY - puppet last run on db2036 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:47:37] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:48:07] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:57] PROBLEM - puppet last run on mw2048 is CRITICAL puppet fail [08:12:27] RECOVERY - puppet last run on mw2048 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:28:55] (03PS2) 10devunt: Add Josa extension and deploy to Korean language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) [08:29:46] (03PS3) 10devunt: Add Josa extension and deploy to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) [08:35:19] (03PS4) 10devunt: Add Josa extension and deploy to testwiki and labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) [08:44:12] devunt, Chris Steipp mentioned coding conventions in his review, I don't see that it has been addressed [08:45:36] I see there's https://gerrit.wikimedia.org/r/#/c/202754/1/Josa.class.php but that's not enough [08:51:21] MaxSem, I checked all files with code-utils/stylize.php [08:51:31] Is there something else that I have to clean it? [08:55:14] function names must be camelCase [08:55:33] documentation [08:55:44] utf8_to_unicode doesn't always return a result [08:56:33] also, this funcytion name is unclear because UTF-8 is Unicode [09:01:36] PROBLEM - puppet last run on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:36] PROBLEM - RAID on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:46] PROBLEM - Hadoop DataNode on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:57] PROBLEM - SSH on analytics1017 is CRITICAL - Socket timeout after 10 seconds [09:01:57] PROBLEM - salt-minion processes on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:17] PROBLEM - Disk space on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:36] PROBLEM - DPKG on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:37] PROBLEM - dhclient process on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:37] PROBLEM - Hadoop NodeManager on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:47] PROBLEM - configured eth on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:51] 6operations: Ensure all disaster recover documentation is in one central location - https://phabricator.wikimedia.org/T95841#1201593 (10Peachey88) 3NEW [09:35:16] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 609 [09:40:16] RECOVERY - check_mysql on db1008 is OK: Uptime: 2659848 Threads: 2 Questions: 16470797 Slow queries: 17712 Opens: 50615 Flush tables: 2 Open tables: 64 Queries per second avg: 6.192 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:40:48] RECOVERY - SSH on analytics1017 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [09:40:56] RECOVERY - salt-minion processes on analytics1017 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:41:07] RECOVERY - Disk space on analytics1017 is OK: DISK OK [09:41:26] RECOVERY - DPKG on analytics1017 is OK: All packages OK [09:41:27] RECOVERY - dhclient process on analytics1017 is OK: PROCS OK: 0 processes with command name dhclient [09:41:27] RECOVERY - Hadoop NodeManager on analytics1017 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [09:41:37] RECOVERY - configured eth on analytics1017 is OK - interfaces up [09:41:57] RECOVERY - puppet last run on analytics1017 is OK Puppet is currently enabled, last run 58 minutes ago with 0 failures [09:41:58] RECOVERY - RAID on analytics1017 is OK no disks configured for RAID [09:42:07] RECOVERY - Hadoop DataNode on analytics1017 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [10:34:13] (03PS1) 10Tim Landscheidt: gridengine: Puppetize gridengine-mailer [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) [11:01:02] (03CR) 10Tim Landscheidt: "Without this change:" [puppet] - 10https://gerrit.wikimedia.org/r/148917 (owner: 10Tim Landscheidt) [11:31:34] 6operations, 10ops-fundraising: Ensure all disaster recover documentation is in one central location - https://phabricator.wikimedia.org/T95841#1201703 (10Krenair) [11:31:55] 6operations, 10ops-fundraising: Ensure all disaster recover documentation is in one central location - https://phabricator.wikimedia.org/T95841#1201593 (10Krenair) Some of those need updating too... [11:50:07] 6operations, 10ops-fundraising: Ensure all disaster recover documentation is in one central location - https://phabricator.wikimedia.org/T95841#1201733 (10Peachey88) >>! In T95841#1201703, @Krenair wrote: > Some of those need updating too... Subtasks, Yo! (or something) [12:24:46] PROBLEM - puppet last run on mw2041 is CRITICAL Puppet has 1 failures [12:36:08] PROBLEM - puppet last run on wtp2006 is CRITICAL puppet fail [12:42:37] RECOVERY - puppet last run on mw2041 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:53:57] RECOVERY - puppet last run on wtp2006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:00:02] (03CR) 10Matanya: [C: 031] various role classes: moar small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/202653 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [14:06:55] (03PS2) 10Tim Landscheidt: Tools: Fix and simplify exim redirectors [puppet] - 10https://gerrit.wikimedia.org/r/148917 [14:09:12] (03CR) 10Tim Landscheidt: "No change (and still undeliverable = good) for T73692 addresses:" [puppet] - 10https://gerrit.wikimedia.org/r/148917 (owner: 10Tim Landscheidt) [14:47:18] (03PS1) 10Andrew Bogott: Modify the scheduler filter to allow host aggregates (maybe) [puppet] - 10https://gerrit.wikimedia.org/r/203665 [14:47:53] !log Attached Helmut Welger@eowiki to the global account of the same name [14:48:01] Logged the message, Master [14:48:13] !log Attached Bradypus@enwiki and Bradypus@commonswiki to the global account of the same name [14:48:17] Logged the message, Master [14:49:53] (03CR) 10Andrew Bogott: [C: 032] Modify the scheduler filter to allow host aggregates (maybe) [puppet] - 10https://gerrit.wikimedia.org/r/203665 (owner: 10Andrew Bogott) [14:54:03] !log Attached Peng@dewiktionary to the global account of the same name [14:54:09] Logged the message, Master [15:02:27] (03PS1) 10Andrew Bogott: Add labvirt1001 to the compute pool [puppet] - 10https://gerrit.wikimedia.org/r/203666 [15:06:49] (03PS1) 10Tim Landscheidt: Tools: Only forward mail for project users [puppet] - 10https://gerrit.wikimedia.org/r/203667 (https://phabricator.wikimedia.org/T93526) [15:07:17] !log Attached Yagosaga@dewikibooks and Yagosaga@commonswiki to the global account of the same name [15:07:21] Logged the message, Master [15:12:59] (03CR) 10Tim Landscheidt: "Tested on Toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/203667 (https://phabricator.wikimedia.org/T93526) (owner: 10Tim Landscheidt) [15:14:50] !log Attached Srbauer@nowiki and Srbauer@sourceswiki to the global account of the same name [15:14:54] Logged the message, Master [15:22:16] !log Attached Aloiswuest@commonswiki, Aloiswuest@dewikiquote and Aloiswuest@dewiktionary to the global account of the same name [15:22:20] Logged the message, Master [15:32:36] PROBLEM - puppet last run on mw1133 is CRITICAL Puppet has 1 failures [15:33:17] PROBLEM - puppet last run on mw1190 is CRITICAL Puppet has 1 failures [15:39:56] !log Attached Manfred Strumpf@commonswiki to the global account of the same name [15:40:01] Logged the message, Master [15:49:28] RECOVERY - puppet last run on mw1190 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:50:18] RECOVERY - puppet last run on mw1133 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:32:57] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1201862 (10GWicke) List of graphite-based alerts in puppet: https://github.com/wikimedia/operations-puppet/search?utf8=%E2%9... [16:43:37] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [16:46:47] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60655 bytes in 0.598 second response time [16:51:46] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [16:54:28] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [16:56:07] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [16:58:07] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60634 bytes in 0.360 second response time [17:10:56] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [17:12:58] 6operations, 6Labs: One instance hammering on NFS should not make it unavailable to everyone else - https://phabricator.wikimedia.org/T95766#1201894 (10JanZerebecki) At the time this happened I looked at https://grafana.wikimedia.org/#/dashboard/db/labs-monitoring and the inbound traffic on labstore1001 nearly... [17:16:27] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [18:09:07] PROBLEM - puppet last run on mc2011 is CRITICAL puppet fail [18:26:57] RECOVERY - puppet last run on mc2011 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [18:46:09] (03CR) 10Dereckson: [C: 04-1] Give patrol to reviewers for testwiki/enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199321 (https://phabricator.wikimedia.org/T93798) (owner: 10Cenarium) [19:06:10] (03PS1) 10Yuvipanda: tools: time out webservice commands after 30s waiting for job [puppet] - 10https://gerrit.wikimedia.org/r/203682 [19:07:07] (03CR) 10Yuvipanda: "Note that my alternative to using this was to use signals.alarm, but invoking signal handlers for this seems like one of those things that" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda) [19:29:53] (03CR) 10Merlijn van Deen: "qsub should return the job ID; can't we just match using that? (I understand it's more work, though, so this can be an OK temporary fix)" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda) [19:30:58] (03CR) 10Yuvipanda: "We don't have the job id to begin with, so we can't use it everywhere, no?" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda) [19:36:11] (03CR) 10Merlijn van Deen: "Why not? We start the job in start_web_job, and qsub returns" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda) [19:37:07] (03CR) 10Tim Landscheidt: "Why don't we have the job number? As Merlijn wrote, it's returned by qsub when starting a web service. And if we're stopping, it's the j" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda) [19:39:15] (03CR) 10Tim Landscheidt: ""qsub -terse" is probably the best approach." [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda) [19:41:00] (03CR) 10Yuvipanda: "Oops, you are all totally right. The job number will still be useless when stopping a webservice and for status, but those seem less likel" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda) [19:49:44] valhallasw`cloud: webservice2 is still hackier than I’d like, though [19:49:52] we need a proper qsub / qstat abstraction [19:50:33] (03CR) 10Tim Landscheidt: "It would block for example on "webservice stop", if the service monitor (what's it called again?) is faster in re-starting the web service" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda) [20:04:04] (03CR) 10Dereckson: [C: 04-1] "The goal of the change is to deploy only the extension to http://ko.wikipedia.beta.wmflabs.org, se we can check all is fine in a environme" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt) [20:09:20] (03PS2) 10Yuvipanda: tools: time out webservice commands after 30s waiting for job [puppet] - 10https://gerrit.wikimedia.org/r/203682 [20:11:04] (03CR) 10Merlijn van Deen: [C: 031] "This is an improvement over the status quo, even if it's not the perfect solution." [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda) [20:19:17] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [20:38:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [20:44:07] PROBLEM - puppet last run on mw2183 is CRITICAL puppet fail [21:03:27] RECOVERY - puppet last run on mw2183 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [23:06:21] (03CR) 10Cenarium: "I'm not sure what you want edited in the commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199321 (https://phabricator.wikimedia.org/T93798) (owner: 10Cenarium)