[00:00:05] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151224T0000). [00:00:05] yurik: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:33] Go away jouncebot. You're drunk. [00:01:14] jouncebot: reload [00:11:32] 6operations, 10Fundraising-Backlog, 10Traffic, 10Unplanned-Sprint-Work, 3Fundraising Sprint Zapp: Firefox SPDY is buggy and is causing geoip lookup errors for IPv6 users - https://phabricator.wikimedia.org/T121922#1902801 (10Ejegg) Some stats for a full day post-fallback-fix: 15,539,986 hits to geoiploo... [00:25:01] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: puppet fail [00:26:46] thx ) [00:27:02] Is ")" just the smile part of the face? [00:27:02] Leah, ? [00:27:12] ah, yes [00:27:21] its a shorthand for a smily )) [00:27:24] :-) [00:32:09] PROBLEM - puppet last run on wtp2019 is CRITICAL: CRITICAL: puppet fail [00:38:44] !log restbase1003: starting `nodetool cleanup` [00:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:53:58] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:58:17] RECOVERY - puppet last run on wtp2019 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [01:06:17] 7Puppet, 6operations, 10Continuous-Integration-Infrastructure, 6Labs: compiler02.puppet3-diffs.eqiad.wmflabs out of disk space - https://phabricator.wikimedia.org/T122346#1902907 (10scfc) >>! In T122346#1902019, @Dzahn wrote: > I checked /mnt/jenkins-workspace/puppet-compiler/output# for especially large o... [02:23:48] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 10m 03s) [02:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:40] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Dec 24 02:30:40 UTC 2015 (duration 6m 52s) [02:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:59:35] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: puppet fail [03:20:47] PROBLEM - puppet last run on mw2188 is CRITICAL: CRITICAL: Puppet has 1 failures [03:27:07] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [03:45:48] RECOVERY - puppet last run on mw2188 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:08:13] (03PS1) 10Yuvipanda: apt: Do not use a proxy in labs [puppet] - 10https://gerrit.wikimedia.org/r/260897 [04:08:52] (03PS2) 10Yuvipanda: apt: Do not use a proxy in labs [puppet] - 10https://gerrit.wikimedia.org/r/260897 [04:11:26] PROBLEM - puppet last run on mw2063 is CRITICAL: CRITICAL: Puppet has 1 failures [04:11:38] (03CR) 10Yuvipanda: [C: 032] "Puppetcompiler says noop." [puppet] - 10https://gerrit.wikimedia.org/r/260897 (owner: 10Yuvipanda) [04:16:37] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 26.92% of data above the critical threshold [100000000.0] [04:36:43] RECOVERY - puppet last run on mw2063 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [04:43:51] (03PS1) 10Yuvipanda: apt: Followup to I801f0e007aac39421 [puppet] - 10https://gerrit.wikimedia.org/r/260898 [04:44:08] (03CR) 10Yuvipanda: [C: 032 V: 032] apt: Followup to I801f0e007aac39421 [puppet] - 10https://gerrit.wikimedia.org/r/260898 (owner: 10Yuvipanda) [04:46:02] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:04:51] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: Puppet has 1 failures [06:08:02] PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:12] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:22] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:31:59] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:32:28] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:32:50] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:09] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:40] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:58] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:59] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:29] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:56:39] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:56:50] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:57:09] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:57:39] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:19] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:39] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:36] * YuviPanda cradles icinga-wm for comfort [06:59:37] nice chat! [07:00:27] nightly puppet failure alerts: still demoralizing, after all these years [07:04:55] PROBLEM - Disk space on restbase1003 is CRITICAL: DISK CRITICAL - free space: /var 110090 MB (3% inode=99%) [07:12:31] ori: interesting. [07:12:38] ori: it turns out apache's logs are actually only rotated weekly [07:12:39] not daily [07:12:59] oh [07:13:01] no [07:13:03] it's set to daily [07:13:05] hmmm [07:13:11] overriden for palladium [07:14:23] yeah [07:14:36] there's plenty of disk space [07:15:32] what'll happen if I flip it to weekly? [07:15:53] I think that on some deep philosophical level, it will be worse [07:15:58] because that is truly admitting defeat, no? [07:16:38] is there defeat left to admit? [07:16:53] well, right now it's "we won't fix it" [07:17:04] but that is tantamount to saying that we can't, either [07:17:28] it's mostly just me and maybe you active when it hits these days. [07:17:47] Giuseppe too when he's not on vacation [07:17:52] true [07:17:57] but anyways: [07:18:27] - log file lookup by date becomes more annoyingspelunking [07:18:35] s/spelunking// [07:19:11] - CPU spike is less frequent but more severe [07:19:46] - bigger risk of filling up the disk [07:20:06] - doesn't actually solve the problem [07:20:38] - i don't have any more bullet points [07:21:02] * YuviPanda sends an NRA-affiliated salesman to ori [07:21:22] on the positive side, it'll be 1/7th as depressing! [07:21:52] I wonder what graceful-stop does to apache [07:22:53] ori: so right now apache does a reload. I wonder if a graceful will 'fix' our problems [07:23:50] or maybe I should just embrace the depression and throw out my subway sandwich and go home from the damn office. [07:23:55] hmm [07:25:48] they're the same [07:26:04] the case statement in /etc/init.d/apache2 is "reload | force-reload | graceful)" [07:26:15] i.e., they are synonymous [07:26:20] haha [07:26:22] that's fun [07:27:19] all of which send apache2 SIGUSR1 [07:27:50] which is already the "nicest" signal https://httpd.apache.org/docs/2.4/en/stopping.html#graceful [07:28:07] heh [07:28:14] maybe [07:28:27] we can make puppet-run *not* run for 10mins on either side of this time [07:28:53] that sounds even more defeated, actually. [07:28:58] almost like... fear, even. [07:29:36] RECOVERY - Disk space on restbase1003 is OK: DISK OK [07:31:35] I really should go home, I guess. [07:47:47] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: puppet fail [08:15:17] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:18:09] (03PS6) 10Mdann52: Tidy robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) [08:37:19] PROBLEM - puppet last run on mw2066 is CRITICAL: CRITICAL: puppet fail [08:54:27] 6operations, 6Performance-Team: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1903140 (10aaron) I'd assume you'd also want to exclude refreshLinks. [09:06:03] RECOVERY - puppet last run on mw2066 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:10:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 652 [09:10:14] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 652 [09:20:13] RECOVERY - check_mysql on lutetium is OK: Uptime: 589246 Threads: 1 Questions: 31757497 Slow queries: 5818 Opens: 49916 Flush tables: 2 Open tables: 64 Queries per second avg: 53.895 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:20:13] RECOVERY - check_mysql on db1008 is OK: Uptime: 233001 Threads: 108 Questions: 11986876 Slow queries: 2979 Opens: 12069 Flush tables: 2 Open tables: 409 Queries per second avg: 51.445 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:23:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] "we got systems with public IPs that we want to use those as well for consistency reasons" [puppet] - 10https://gerrit.wikimedia.org/r/260872 (owner: 10Yuvipanda) [09:26:04] 10Ops-Access-Requests, 6operations, 6Analytics-Backlog: add mforns, milimetric, nuria,ottomata, madhuvishy and joal to piwik-roots - https://phabricator.wikimedia.org/T122325#1903144 (10akosiaris) >>! In T122325#1902508, @Nuria wrote: > We will need to be able to see logs, do queries and start an stop the w... [09:44:19] 6operations, 6Performance-Team, 10Thumbor, 5Patch-For-Review: Use cgroups to limit thumbor & subprocesses resource usage - https://phabricator.wikimedia.org/T120940#1903149 (10Gilles) 5Open>3Resolved [09:46:15] 6operations, 6Services, 7Graphite, 7Icinga, 7Monitoring: various graphite based monitoring checks broken (memcached, parsoid, restbase, eventlogging..) - https://phabricator.wikimedia.org/T122332#1903156 (10fgiunchedi) [09:46:43] 6operations, 6Services, 7Graphite, 7Icinga, 7Monitoring: various graphite based monitoring checks broken (memcached, parsoid, restbase, eventlogging..) - https://phabricator.wikimedia.org/T122332#1901681 (10fgiunchedi) merged with {T105218} as they seem the same [10:00:51] 6operations, 7HTTPS: ssl certificate replacement: tendril.wikimedia.org (expires 2016-02-15) - https://phabricator.wikimedia.org/T122319#1903165 (10jcrespo) a:5jcrespo>3RobH Downtime is not important, assuming it is only for a few minutes, as there is not any ongoing issue. Also doing within at your normal... [10:03:35] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1903167 (10JAllemandou) I agree bottleneck will be on IOs, and might come sonn depending on expected response times. Also, read consistency of one could indeed be set to one. Thanks @Gwicke, @fgiunchedi... [10:22:24] (03PS2) 10Mobrovac: EventBus: add spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/260799 [10:33:23] !log restarting 's2' replication on dbstore200[12] after cloning [10:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:38:10] (03CR) 10Mobrovac: "https://puppet-compiler.wmflabs.org/1550/ says life's good for kafak1001" [puppet] - 10https://gerrit.wikimedia.org/r/260799 (owner: 10Mobrovac) [10:40:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 774 [10:50:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 689 [10:55:13] RECOVERY - check_mysql on db1008 is OK: Uptime: 238701 Threads: 112 Questions: 12100117 Slow queries: 3153 Opens: 12672 Flush tables: 2 Open tables: 411 Queries per second avg: 50.691 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2 [11:07:31] (03PS1) 10Filippo Giunchedi: puppetmaster: add facts export script [puppet] - 10https://gerrit.wikimedia.org/r/260910 [11:09:42] (03Abandoned) 10Yuvipanda: Restrict url downloader and proxy to $INTERNAL only [puppet] - 10https://gerrit.wikimedia.org/r/260872 (owner: 10Yuvipanda) [11:28:47] !log restarting and reconfiguring mysql at db2044 [11:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:31:19] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 58 failures [11:40:01] PROBLEM - puppet last run on mw2060 is CRITICAL: CRITICAL: puppet fail [11:50:44] !log restarting and reconfiguring mysql at db2051 [11:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:02:26] PROBLEM - RAID on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:35] RECOVERY - RAID on mw1008 is OK: OK: no RAID installed [12:04:07] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [12:08:06] RECOVERY - puppet last run on mw2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:10:38] (03CR) 10Alexandros Kosiaris: CX: Use config.yaml to read registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260575 (owner: 10KartikMistry) [12:12:41] !log restart and reconfiguring mysql for db2058 [12:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:19:27] 6operations, 10Traffic, 10netops: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1903240 (10faidon) Correct, this is a #netops issue. For reference, the canonical address for contacting us about such issues is noc@ (which is an industry-standard one). Up-to-date... [12:19:36] 6operations, 10netops: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1903241 (10faidon) [12:19:44] 6operations, 10netops: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1903242 (10faidon) p:5Triage>3Normal [12:32:15] !log restart and reconfigure mysql at db2065 [12:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:20] (03PS1) 10Faidon Liambotis: apt: move the absents outside the $use_proxy guard [puppet] - 10https://gerrit.wikimedia.org/r/260917 [13:12:28] (03CR) 10KartikMistry: CX: Use config.yaml to read registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260575 (owner: 10KartikMistry) [13:12:57] (03CR) 10Faidon Liambotis: [C: 032] apt: move the absents outside the $use_proxy guard [puppet] - 10https://gerrit.wikimedia.org/r/260917 (owner: 10Faidon Liambotis) [13:42:22] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:32] PROBLEM - puppet last run on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:41] PROBLEM - RAID on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:42] PROBLEM - configured eth on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:52] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:43:21] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:43:41] PROBLEM - SSH on mw1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:51] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:43:51] PROBLEM - DPKG on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:03] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:48:31] PROBLEM - puppet last run on mc2016 is CRITICAL: CRITICAL: puppet fail [13:51:52] PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: Puppet has 1 failures [13:57:22] RECOVERY - SSH on mw1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [13:57:52] RECOVERY - Disk space on mw1012 is OK: DISK OK [13:58:11] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [13:58:32] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [13:58:32] RECOVERY - configured eth on mw1012 is OK: OK - interfaces up [13:59:02] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:59:41] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:59:41] RECOVERY - DPKG on mw1012 is OK: All packages OK [14:00:31] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures [14:00:31] RECOVERY - RAID on mw1012 is OK: OK: no RAID installed [14:00:44] 6operations, 10netops: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1903372 (10Trizek-WMF) Thank you @faidon. I let @Sylvain_WMFr deal with it in January. [14:02:29] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:04:39] PROBLEM - puppet last run on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:04:49] PROBLEM - RAID on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:05:09] PROBLEM - configured eth on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:05:38] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:05:59] PROBLEM - SSH on mw1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:06:29] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:06:43] !log restart and reconfigure mysql at db2038 [14:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:58] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:16] !log rolling restart of hhvm jobrunners (T122069) [14:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:59] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:09] PROBLEM - DPKG on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:50] !log powercycling mw1012, OOM'ed/stuck [14:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:19] RECOVERY - RAID on mw1012 is OK: OK: no RAID installed [14:14:29] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:14:39] RECOVERY - configured eth on mw1012 is OK: OK - interfaces up [14:14:39] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [14:14:50] RECOVERY - DPKG on mw1012 is OK: All packages OK [14:14:59] RECOVERY - Disk space on mw1012 is OK: DISK OK [14:15:29] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [14:15:30] RECOVERY - SSH on mw1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [14:15:59] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:16:08] RECOVERY - puppet last run on mc2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:08] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 39 minutes ago with 0 failures [14:19:19] RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:29:29] (03PS1) 10Faidon Liambotis: network: move esams/ulsfo subnets below codfw [puppet] - 10https://gerrit.wikimedia.org/r/260922 [14:29:31] (03PS1) 10Faidon Liambotis: network: move frack networks into a separate realm [puppet] - 10https://gerrit.wikimedia.org/r/260923 [14:29:34] (03PS1) 10Faidon Liambotis: network: split frack into its proper subnets [puppet] - 10https://gerrit.wikimedia.org/r/260924 [14:29:35] (03PS1) 10Faidon Liambotis: network: add sandbox "realm" [puppet] - 10https://gerrit.wikimedia.org/r/260925 [14:29:38] (03PS1) 10Faidon Liambotis: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) [14:42:09] (03CR) 10Jgreen: [C: 04-1] "minor fix" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260924 (owner: 10Faidon Liambotis) [15:02:26] !log restarting and reconfiguring mysql at db2045 [15:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:36] 6operations, 6Labs, 10netops: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406#1903452 (10faidon) 3NEW [15:19:48] Is a phabricator admin here? I need someone with console access to phabricator [15:20:59] Ok, problem solved automatically [15:21:20] the task deamon stopped, but now he is ok [15:21:53] hm, seems like he stopped again [15:31:59] UBN: https://phabricator.wikimedia.org/T122408 [15:49:58] Luke081515, what noticible effects is that causing? [15:50:17] for phabricator users, I mean [15:56:58] jynus: This daemon is responsible for importing the repos, so maybe not all are actually imported. (I don't know more, because I can't see the logs) [15:58:08] I would not classify that as an unbreak now- even if that was true, repos can surely resynced easily [15:59:10] I am not saying it shouldn't be reasearched, but as you may already have seen, there are some already known issues with our phabricator [16:02:23] I would mention the issue on T112776, and let it untriaged so it is seen by the appropiate people [16:11:31] !log restart and mysql reconfguration of db2052 [16:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:21] ACKNOWLEDGEMENT - HTTPS on magnesium is CRITICAL: SSL CRITICAL - Certificate rt.wikimedia.org valid until 2016-01-09 09:48:57 +0000 (expires in 15 days) daniel_zahn migration to phab in progress - then it will be moved to misc-web [16:22:58] what's "aqm_mem" on silicon? it's just slightly above warning level. WARNING MemoryPercentUsage 32 > 30 , StorePercentUsage 38 > 30 [16:23:17] amq_store, amq_mem [16:24:08] ah, Apache ActiveMQ and FR it looks [16:24:12] nice work with the acks, mutante [16:24:33] jynus: oh, did you see the ones for warnings in the app? [16:24:41] ? [16:25:02] I see icinga ping, which is good [16:25:06] *pink [16:25:17] ah ;) [16:25:23] so the ACK that icinga-wm talks about above, is an ACK for a CRITICAL [16:25:51] i also ACKed some that are not CRIT but just WARN level (for now) [16:26:07] but icinga-wm filters the warnings to not be too spammy [16:26:31] and i know you used an android app, so i thought that might have told you [16:28:56] 6operations, 10Fundraising-Backlog, 10Traffic, 10Unplanned-Sprint-Work, 3Fundraising Sprint Zapp: Firefox SPDY is buggy and is causing geoip lookup errors for IPv6 users - https://phabricator.wikimedia.org/T121922#1903527 (10faidon) Just to complete the numbers above: only 2 out of 743 1:1000 sampled req... [16:33:11] eh.. what "Failed to parse inline template: undefined method `%' for nil:NilClass at /mnt/jenkins-workspace/puppet-compiler/1551/change/src/manifests/role/backup.pp:" [16:35:38] yeah, someone mentioned a problem with the compiler, but sadly I wasn't on the loop, maybe the backscrol has more info than I do [16:37:11] mutante: hiera issue, where $uniqueid/@uniqueid is not defined? The only obvious % is the one in @days[[@uniqueid].pack("H*").unpack("L")[0] % 7] [16:37:55] hmm,thanks both of you.. [16:38:10] i guess it's not related to the "out of disk space' issue yesterday [16:41:38] (03CR) 10Alexandros Kosiaris: "Yeah, https://gerrit.wikimedia.org/r/260918 LGTM. So this patchset needs some update and to be deployed after https://gerrit.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/260575 (owner: 10KartikMistry) [16:45:50] !log restart and reconfigure mysql at db2059 [16:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:24] reads "Refreshing facts on the puppet compiler is easy!"... then sees the script for that: "find yaml .. | | xargs -n1 perl -i"" -pe 's/^(\s*uniqueid:).*$/$1 "400a1000"/; s/^(\s*boardserialnumber:).*$/$1 "..CN123456AB00AA."/; s/^(\s*serialnumber:).*$/$1 AB12BP0/' " :)) [16:50:40] i'm trying the instructions from joe i found after reading backlog [16:50:52] https://wikitech.wikimedia.org/wiki/User:Giuseppe_Lavagetto/RefreshPuppetCompiler [16:51:09] mutante: incidentally I've posted https://gerrit.wikimedia.org/r/#/c/260910/ today [16:51:48] godog: awesome:) i just thought about doing something like that when i saw it on gist [16:57:27] (03CR) 10Dzahn: [C: 031] "tested this on palladium (right after joe's original script) and it works fine. thank you" [puppet] - 10https://gerrit.wikimedia.org/r/260910 (owner: 10Filippo Giunchedi) [16:57:30] godog: tested [16:58:46] mutante: sweet, thanks, yeah I updated the facts earlier today too [17:02:19] hmmm, i wonder why i'd have to do it again for this change actually, i did it anyways [17:02:31] (if you already did it earlier) [17:05:19] yea, so it looks the error is unrelated :p [17:05:58] it's still there like before. but i learned how to refresh it, that's something [17:09:39] (03PS3) 10Dzahn: bugzilla-static: move role to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260606 [17:12:34] (03CR) 10Dzahn: [C: 032] bugzilla-static: move role to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260606 (owner: 10Dzahn) [17:20:12] !log restarting and reconfiguring mysql at db2066 [17:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:31:16] !log aqs: tweaked table properties for local_group_default_T_pageviews_per_article_flat: 2 months max DTCS window size, deflate compression [17:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:33:08] (03PS1) 10Dzahn: osm: split and move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260936 [17:45:11] (03PS1) 10Dzahn: beta: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260937 [18:12:31] (03PS1) 10Dzahn: ci: split and move role classes to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260939 [18:18:48] (03PS1) 10Dzahn: dataset: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260940 [18:22:00] 6operations: mw1228 reporting readonly file system - https://phabricator.wikimedia.org/T122005#1903579 (10RobH) [18:29:18] (03PS1) 10Dzahn: graphite: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260941 [18:32:25] robh: re: mw1228 [18:32:27] ssh mw1228.eqiad.wmnet [18:32:27] packet_write_wait: Connection to UNKNOWN: Broken pipe [18:32:37] cant even login.. [18:33:14] eqiad tag to have hardware checked? [18:34:20] mw1228 login: root [18:34:20] [34428496.513327] end_request: I/O error, dev sda, sector 197411136 [18:34:43] 6operations, 10ops-eqiad: mw1228 reporting readonly file system - https://phabricator.wikimedia.org/T122005#1903587 (10Dzahn) [18:35:25] 6operations, 10ops-eqiad: mw1228 reporting readonly file system - https://phabricator.wikimedia.org/T122005#1893636 (10Dzahn) when trying to ssh to it: packet_write_wait: Connection to UNKNOWN: Broken pipe when trying console login: mw1228 login: root [34428496.513327] end_request: I/O error, dev sda, secto... [18:38:58] mutante: someone needs to reboot it remotely and fsck before we drop an onsite [18:39:01] thats why i didnt yet [18:39:03] imo [18:39:15] i plan to shortly [18:39:30] i triaged off the patchset waiting cuz there isnt one. [18:39:36] all the patchsets listed were merged. [18:39:54] but if you do that then yea [18:40:02] robh: it looks pretty much that sda is dead [18:40:03] tag it with ops-eqiad and set to no owner so chris takes care of it [18:40:20] yea, it typically is, but we have dispatched in the past (others have, not you) without checking [18:40:24] so thanks for checking! [18:40:42] oh [18:40:47] mutante: wait, ssh is the only test you did? [18:40:48] alright, yep, i added the ops-eqiad but no owner [18:41:04] oh, i see the task, disregard question [18:41:08] no, also mgmt [18:41:18] yep, the dev sda errors are a pretty clear indicator [18:41:19] robh: that I/O error is failed disk typically but no mgmt is odd [18:41:32] cmjohnson1: mgmt worked [18:41:36] i get a console [18:41:39] mutante: had it pulled up via mgmt to see the errors [18:41:40] but when i type "root" to login [18:41:46] it fills the screen with I/O errors [18:41:50] ah..okay then yah failed disk [18:41:51] heh, ice [18:42:02] thx for checking before tasking to onsite though, its much appreciated [18:42:09] yw [18:42:46] granted, chris can do all the sw checking of course, but he doesnt scale like the team does, heh [18:56:44] 6operations, 7HTTPS: ssl certificate replacement: tendril.wikimedia.org (expires 2016-02-15) - https://phabricator.wikimedia.org/T122319#1903599 (10RobH) Awesome. I don't think its a great idea to push an ssl change the day before a holiday+weekend, so I'll keep this stalled on me and implement when I return... [19:03:17] 6operations, 10ops-codfw: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1903602 (10Papaul) [19:22:32] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Puppet has 1 failures [19:38:10] (03PS1) 10Papaul: Add mgmt DNS entries for pc200[4-6] Add production DNS entries for 200[4-6] Bug:T121879 [dns] - 10https://gerrit.wikimedia.org/r/260942 (https://phabricator.wikimedia.org/T121879) [19:48:53] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:29:04] 6operations, 10ops-codfw: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1903671 (10Papaul) [21:32:05] 6operations, 10ops-codfw: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1903672 (10Papaul) pc2004 row B rack 5 10.193.2.231 ge-5/0/35 pc2005 row C rack 5 10.193.2.231 ge-5/0/3 pc2006 row D rack 5 10.193.2.231 ge-5/0/6 [21:32:31] 6operations, 10ops-codfw: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1903673 (10Papaul) [21:38:25] greg-g: What do you think about my doing a little Krampusnacht CentralNotice deployment? [21:38:52] Looks like the deployment schedule has been curtailed this week, to keep everyone's remaining sanity... [21:44:18] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: Puppet has 68 failures [21:45:10] greg-g: nvm, I see the note at the top of the section now: Thursday is the new Friday. 10-4! [22:57:05] 6operations, 10ops-codfw, 10fundraising-tech-ops: bellatrix hardware RAID predictive failure - https://phabricator.wikimedia.org/T122026#1903805 (10RobH) p:5Triage>3High [23:09:53] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:10:14] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:10:42] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:14] PROBLEM - configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:42] PROBLEM - dhclient process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:52] PROBLEM - Check size of conntrack table on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:12:02] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:12:03] PROBLEM - salt-minion processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:12:22] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:12:23] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:12:52] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:12:53] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:13:02] PROBLEM - SSH on mw1114 is CRITICAL: Server answer [23:26:14] !log mw1114 spammed all icinga errors, system is outputting endless scroll of login prompt, not halting for input (like anohter session or crash cart is sending it, or an error) [23:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:44] and now its not letting me in on serial.. thats annoying.. [23:27:10] !log i just reset dra on mw1114 because it said it was in use and i didnt see a log yet :;p [23:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:43] Why do you monitor so many services all for one host.. [23:27:59] robh: it was me, i thought it was the drac error [23:28:19] ok [23:28:27] uh, i will stop making a ticket and let you handle ;] [23:28:31] !log powercycled mw1114 [23:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:28:44] I was on it diagnosing an error though [23:28:49] so it may be gone now [23:28:58] i logged error though so you can check. [23:29:14] oh, sorry, i just saw garbled output on console [23:29:16] that made me do it [23:29:41] cool that you logged [23:30:29] coming back normal [23:30:43] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [23:30:53] RECOVERY - configured eth on mw1114 is OK: OK - interfaces up [23:31:13] RECOVERY - dhclient process on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [23:31:18] SPF|Cloud: the monitoring is alright, but what we'd want is dependencies between them so they dont ALL talk about it when the entire host is down [23:31:23] RECOVERY - Check size of conntrack table on mw1114 is OK: OK: nf_conntrack is 0 % full [23:31:50] icinga can do it, but puppet icinga not yet [23:31:54] RECOVERY - DPKG on mw1114 is OK: All packages OK [23:32:13] RECOVERY - Disk space on mw1114 is OK: DISK OK [23:32:32] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 25 processes with command name hhvm [23:32:33] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 66855 bytes in 0.433 second response time [23:32:55] yea all the checks are there because any single one of them failing has likely lead to apaches falling over in the past ;] [23:33:03] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [23:33:19] we had like 2 checks for each host a few years ago, which is not enough! (we still are adding more specialized checks for services) [23:33:22] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [23:33:33] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [23:33:41] but indeed, what mutante said would be nice, where it has a hiarchy and only pages for the highest alert within it [23:33:52] it would also save our sanity during outage conditions [23:34:04] http://docs.icinga.org/latest/en/dependencies.html [23:34:12] RECOVERY - salt-minion processes on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:34:55] If all of the notification dependency tests for the service passed, Icinga will send notifications out for the service as it normally would. If even just one of the notification dependencies for a service fails, Icinga will temporarily repress notifications for that (dependent) service [23:35:14] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.257 second response time [23:35:35] that stuff should make it possible.. if the host is down.. it would supress notifications for the services on this host [23:35:41] because then it's obvious they are all down [23:35:43] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:40:40] sounds neat