[00:00:07] (03PS1) 10Dzahn: mysql_wmf, protoproxy: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211326 [00:00:34] (03PS2) 10Dzahn: mysql_wmf, protoproxy: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211326 [00:01:43] (03CR) 10Dzahn: [C: 032] puppet,puppet_compiler: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211322 (owner: 10Dzahn) [00:02:26] (03CR) 10Dzahn: [C: 032] kibana, labs_vmbuilder: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211323 (owner: 10Dzahn) [00:03:11] (03CR) 10Dzahn: [C: 032] logstash: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211324 (owner: 10Dzahn) [00:04:07] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [00:05:07] PROBLEM - SSH on labvirt1003 is CRITICAL - Socket timeout after 10 seconds [00:05:16] PROBLEM - RAID on labvirt1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:17] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:38] PROBLEM - configured eth on labvirt1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:08:09] uhm.. libvirt on labvirt being really busy ^ [00:21:12] (03PS1) 10Dzahn: salt, spamassassin: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211329 [00:25:39] (03Abandoned) 10Dzahn: lots of indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/204696 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [00:28:23] (03PS1) 10Dzahn: ganglia: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211333 [00:31:25] (03PS1) 10Dzahn: geoip: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211334 [00:33:02] (03PS13) 10BryanDavis: [WIP] Add role::mediawiki_vagrant_lxc [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) [00:33:26] PROBLEM - NTP on labvirt1003 is CRITICAL: NTP CRITICAL: No response from NTP server [00:37:09] (03PS1) 10Dzahn: mediawiki_singlenode: rename defined type [puppet] - 10https://gerrit.wikimedia.org/r/211335 [00:44:49] (03PS1) 10Dzahn: contint: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211337 [00:48:40] (03PS1) 10Dzahn: monitoring, RT: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211339 [00:52:50] (03PS1) 10Dzahn: osm, rsync: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211341 [00:58:27] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 3.39% of data above the critical threshold [1000.0] [01:03:44] (03PS1) 10Dzahn: lvs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211343 [01:04:31] (03CR) 10jenkins-bot: [V: 04-1] lvs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211343 (owner: 10Dzahn) [01:08:58] (03PS2) 10Dzahn: lvs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211343 [01:09:54] (03CR) 10jenkins-bot: [V: 04-1] lvs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211343 (owner: 10Dzahn) [01:11:38] (03PS1) 10Dzahn: mirrors: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211345 [01:15:13] (03PS3) 10Dzahn: lvs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211343 [01:20:07] (03PS2) 10Dzahn: mirrors: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211345 [01:20:21] (03PS1) 10Dzahn: labs_lvm: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211346 [01:22:41] (03PS1) 10Dzahn: dynamicproxy: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211347 [01:29:33] (03PS1) 10Dzahn: snapshot: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211348 [01:32:04] (03PS2) 10Dzahn: snapshot: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211348 [01:34:58] (03PS1) 10Dzahn: ocg: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211349 [01:38:47] (03PS1) 10Dzahn: jenkins,package_builder,labs_bootstrapvz: lint [puppet] - 10https://gerrit.wikimedia.org/r/211350 [01:44:13] (03PS1) 10Dzahn: statistics: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211351 [01:48:42] (03PS2) 10Dzahn: statistics: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211351 [01:58:46] (03PS1) 10Dzahn: varnish: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211352 [02:00:42] (03PS1) 10Dzahn: gitblit: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211353 [02:04:47] (03PS1) 10Dzahn: quarry: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211354 [02:20:41] (03PS1) 10Dzahn: nrpe: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211355 [02:20:43] (03PS1) 10Dzahn: openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211356 [02:22:02] (03PS1) 10Dzahn: mariadb: indentation fixes [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/211357 [02:22:59] !og es-tool restart-fast on elastic1029 [02:25:05] (03PS1) 10Dzahn: mysql: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211358 [02:25:07] !log l10nupdate Synchronized php-1.26wmf5/cache/l10n: (no message) (duration: 05m 55s) [02:25:23] Logged the message, Master [02:28:37] (03PS1) 10Dzahn: phabricator: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211359 [02:29:41] !log LocalisationUpdate completed (1.26wmf5) at 2015-05-16 02:28:37+00:00 [02:29:48] Logged the message, Master [02:29:57] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (91005s 90000s) [02:33:31] (03PS1) 10Dzahn: git: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211360 [02:35:07] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [02:39:37] PROBLEM - High load average on labstore1001 is CRITICAL 87.50% of data above the critical threshold [24.0] [02:43:19] !log l10nupdate Synchronized php-1.26wmf6/cache/l10n: (no message) (duration: 04m 55s) [02:43:29] Logged the message, Master [02:47:11] !log LocalisationUpdate completed (1.26wmf6) at 2015-05-16 02:46:08+00:00 [02:47:24] Logged the message, Master [02:47:37] PROBLEM - puppet last run on mw2108 is CRITICAL Puppet has 1 failures [02:55:06] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1330 bytes in 0.254 second response time [02:58:43] !og es-tool restart-fast on elastic1030 [03:04:07] RECOVERY - puppet last run on mw2108 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [03:25:37] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.244 second response time [03:53:46] mutante: still there? [04:11:47] !log restarting sshd and generally poking around on labvirt1003 [04:11:53] Logged the message, Master [04:13:28] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [04:17:07] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [04:18:22] 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290156 (10Andrew) 3NEW a:3Cmjohnson [04:19:40] 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290168 (10Andrew) There's a fair amount of other ugliness in dmesg, e.g. [1843134.114144] INFO: task gmond:61831 blocked for more than 120 seconds. [1843134.145729] Not tainted 3.13.0-49-generic #83-Ub... [04:20:34] 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290169 (10Andrew) I'm going to leave the system up for now, since we might as well minimize the labs outage. I can't imagine this isn't going to require a dc visit though :( [04:20:53] 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290170 (10Andrew) p:5Triage>3Unbreak! [04:25:07] PROBLEM - High load average on labstore1001 is CRITICAL 62.50% of data above the critical threshold [24.0] [04:30:07] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [04:33:37] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [04:36:51] 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290172 (10Andrew) Oh, btw, sshd and ganglia-monitor are comatose on that system for reasons that are unclear to me. The mgmt console is working fine. [04:41:38] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [04:52:37] PROBLEM - puppet last run on mw2125 is CRITICAL puppet fail [05:02:05] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat May 16 05:01:02 UTC 2015 (duration 1m 1s) [05:02:12] Logged the message, Master [05:04:46] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [05:08:26] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (18051 90000s) [05:12:18] RECOVERY - puppet last run on mw2125 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:26:38] PROBLEM - puppet last run on db1073 is CRITICAL Puppet has 1 failures [05:39:17] PROBLEM - High load average on labstore1001 is CRITICAL 75.00% of data above the critical threshold [24.0] [05:44:56] RECOVERY - puppet last run on db1073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:50:57] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [06:01:07] PROBLEM - nova-compute process on labvirt1003 is CRITICAL: Connection refused by host [06:01:07] PROBLEM - DPKG on labvirt1003 is CRITICAL: Connection refused by host [06:01:36] PROBLEM - salt-minion processes on labvirt1003 is CRITICAL: Connection refused by host [06:02:07] PROBLEM - Disk space on labvirt1003 is CRITICAL: Connection refused by host [06:02:17] PROBLEM - dhclient process on labvirt1003 is CRITICAL: Connection refused by host [06:03:23] <_joe_> !log killed nrpe on labvirt1003 - see T99341 [06:03:29] Logged the message, Master [06:05:13] 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290209 (10Joe) @andrew why leaving this up would have "minimized the labs outage" is not clear to me. You've basically left a completely broken system (and an UBN!) ticket open to be consumed over the weekend... [06:06:12] 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290210 (10Joe) The only reason why I'm not rebooting this machine is that Andrew implied it would mean having downtime for labs, but I don't really see an alternative to an hard powercycle for now. [06:09:48] moring _joe_ [06:10:45] *morning [06:13:14] 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290226 (10mark) According to the ILO sensors both the fans and the temp sensors indicate OK/good health, so I doubt it's actually a matter of overheating. [06:22:33] so does anyone have shell access to labvirt1003? [06:23:43] serial console doesn't do much for me [06:24:26] mark: salt 'labvirt1003*' cmd.run 'some command' works [06:24:28] on palladium [06:25:21] <_joe_> mark: yeah the serial console is my fault, I tried to run ip link show with strace, and it got in D state immediately [06:25:32] pretty fucked [06:25:35] <_joe_> I am grepping the logs via salt [06:25:36] but I don't think it's really overheating [06:25:38] ok [06:25:40] <_joe_> it's not [06:26:04] <_joe_> we have mce signalling errors, but I was trying to figure out why mcelog --client gives nothing back [06:26:20] we've seen such things on dells with some bios settings [06:26:25] but not sure how that relates to these HPs [06:26:25] <_joe_> then we have a ton of processes stuck trying to reach the network [06:26:41] <_joe_> in close_wait state, to be precise [06:27:05] the cpu alarms show up on a number of hosts and don't appear to be at all related [06:27:15] <_joe_> and lockups in the kernel, related to sending network traffic [06:27:30] <_joe_> ori: the cpu alarms are a well known false positive [06:27:43] <_joe_> everyone at the WMF has alarmed for those at least once :) [06:27:50] yeah [06:28:32] <_joe_> so, my best bet would be a hard reboot, but since the instances on it are running fine, I'd wait for it to be a problem for users in fact, or monday [06:28:36] <_joe_> whatever comes first [06:29:10] <_joe_> I'm 99% sure it's a software issue, but still, mce logs (that I'm trying to understand how to read) [06:29:46] i'm still curious about why multiple labvirt* hosts starting showing an "error while receiving frame on vnetXX: Network is down" within a few minutes of each other at around 18:00 [06:29:56] PROBLEM - puppet last run on mw2092 is CRITICAL puppet fail [06:29:57] texting coren [06:30:13] <_joe_> ori: maybe related to some changes andrew was attempting [06:30:22] <_joe_> right, coren is in the right TZ now :) [06:30:44] <_joe_> mark: I'm getting off for a few minutes, my daughter wants breakfast :) [06:30:48] ok [06:31:57] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on wtp2015 is CRITICAL Puppet has 1 failures [06:32:28] PROBLEM - puppet last run on cp3042 is CRITICAL Puppet has 1 failures [06:33:16] PROBLEM - puppet last run on mw1042 is CRITICAL Puppet has 2 failures [06:33:17] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [06:34:17] PROBLEM - puppet last run on mw1166 is CRITICAL Puppet has 2 failures [06:45:08] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:45:17] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [06:45:47] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:48:36] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [07:02:41] 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290273 (10ArielGlenn) The labs instances on the box seem to be working fine fwiw. [07:05:46] RECOVERY - puppet last run on mw1166 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:07:57] RECOVERY - puppet last run on mw2092 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [07:07:57] RECOVERY - puppet last run on mw1042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:07] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:48] RECOVERY - puppet last run on wtp2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:23:27] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [07:25:06] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [07:39:46] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [07:40:09] PROBLEM - High load average on labstore1001 is CRITICAL 77.78% of data above the critical threshold [24.0] [07:46:47] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [07:49:01] !log restart hhvm on mw1234, still pushing xhprof metrics [07:49:07] Logged the message, Master [08:00:36] akosiaris, hi, do you know if i have been granted graphoid cluster access? I see strange patterns of restarts on the graphoid [08:00:41] http://grafana.wikimedia.org/#/dashboard/db/graphoid [08:00:59] gwicke, ^ [08:21:01] 6operations, 10ops-esams: Implement CWDM between knams and esams - https://phabricator.wikimedia.org/T98971#1290280 (10mark) 5Open>3Resolved a:3mark Both CWDM systems are now in use, with one channel on each fiber. We can add channels (e.g. management from now on), but that doesn't need to block this tic... [08:25:27] PROBLEM - High load average on labstore1001 is CRITICAL 87.50% of data above the critical threshold [24.0] [08:59:57] RECOVERY - DPKG on labvirt1003 is OK: All packages OK [08:59:57] RECOVERY - nova-compute process on labvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [09:00:27] RECOVERY - RAID on labvirt1003 is OK no RAID installed [09:00:37] RECOVERY - salt-minion processes on labvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:00:56] RECOVERY - configured eth on labvirt1003 is OK - interfaces up [09:00:57] RECOVERY - SSH on labvirt1003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [09:01:07] RECOVERY - Disk space on labvirt1003 is OK: DISK OK [09:01:27] RECOVERY - dhclient process on labvirt1003 is OK: PROCS OK: 0 processes with command name dhclient [09:01:48] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [09:10:00] !log bounce hhvm on mw1141 [09:10:07] Logged the message, Master [09:12:07] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [09:15:43] !log bounce hhvm on mw1196 [09:15:50] Logged the message, Master [09:16:15] godog, around? could you send me the logs from graphoid please? [09:16:22] its failing for some reason [09:17:05] yurik: sure, which hosts in particular? [09:17:56] godog, they are running on sca1... and sca2..., no idea what host is actually causing the restarts [09:18:09] i am guessing that sca1 is the only one active atm [09:18:18] in eqiad [09:19:36] godog, if you want, put them on stat1002 into my homedir [09:22:00] yurik: should be in your home [09:22:06] thanks! [09:22:17] yw [09:30:07] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 745 [09:35:07] RECOVERY - check_mysql on db1008 is OK: Uptime: 2580575 Threads: 1 Questions: 7918388 Slow queries: 17140 Opens: 42915 Flush tables: 2 Open tables: 64 Queries per second avg: 3.068 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:01:48] (03PS1) 10KartikMistry: Beta: Enable all languages in source [puppet] - 10https://gerrit.wikimedia.org/r/211371 (https://phabricator.wikimedia.org/T98946) [10:06:07] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 62.50% of data above the critical threshold [35.0] [10:38:26] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [11:21:46] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [11:37:08] (03PS1) 10Faidon Liambotis: Revert "depool ulsfo due to traffic issues" [dns] - 10https://gerrit.wikimedia.org/r/211389 [11:37:15] (03PS2) 10Faidon Liambotis: Revert "depool ulsfo due to traffic issues" [dns] - 10https://gerrit.wikimedia.org/r/211389 [11:37:29] (03CR) 10Faidon Liambotis: [C: 032] Revert "depool ulsfo due to traffic issues" [dns] - 10https://gerrit.wikimedia.org/r/211389 (owner: 10Faidon Liambotis) [12:10:57] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures [12:17:37] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [12:27:17] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [12:46:16] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [12:57:56] (03PS1) 10KartikMistry: Typo: Fix typo in cxserver module [puppet] - 10https://gerrit.wikimedia.org/r/211392 [12:59:33] (03PS1) 10Chmarkine: transparency - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/211394 (https://phabricator.wikimedia.org/T40516) [13:05:37] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [13:18:27] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [13:25:12] !log es-tool restart-fast on elastic1031 [13:25:23] Logged the message, Master [13:26:58] !log that was the last server in the elasticsearch rolling restart. all done. now we have new versions of the plugins. Lets try not to do that again. [13:27:05] Logged the message, Master [13:30:43] 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290542 (10Andrew) 5Open>3Resolved Detailed report is here: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150515-LabsOutage [14:21:06] 6operations, 10Wikimedia-DNS, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1290616 (1080686) good point about the privacy policy. My suggestion is that we point at the WMF privacy policy, as this is a service set up... [14:30:37] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [14:52:36] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [15:00:39] (03PS2) 10KartikMistry: Beta: Enable all languages in source and target [puppet] - 10https://gerrit.wikimedia.org/r/211371 (https://phabricator.wikimedia.org/T98946) [15:01:45] (03PS3) 10KartikMistry: Beta: CX: Enable all languages in source and target [puppet] - 10https://gerrit.wikimedia.org/r/211371 (https://phabricator.wikimedia.org/T98946) [15:09:00] (03CR) 10Santhosh: [C: 031] Beta: CX: Enable all languages in source and target [puppet] - 10https://gerrit.wikimedia.org/r/211371 (https://phabricator.wikimedia.org/T98946) (owner: 10KartikMistry) [15:16:57] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [15:20:17] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [15:25:43] (03CR) 10Nikerabbit: [C: 031] CX: Enable 'cxstats' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211116 (owner: 10KartikMistry) [15:26:31] (03PS1) 10Dereckson: Fixed whitespace issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211407 [15:26:33] (03PS1) 10Dereckson: Site name configuration on ast.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211408 (https://phabricator.wikimedia.org/T99315) [15:33:05] (03CR) 10JanZerebecki: [C: 031] transparency - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/211394 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [15:36:28] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [15:41:27] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [15:50:08] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [16:05:38] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [16:21:57] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [16:30:20] (03CR) 10Bartosz DziewoƄski: [C: 04-1] "This is a step backwards. See my comment on the task for more." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175406 (https://phabricator.wikimedia.org/T73477) (owner: 10Glaisher) [16:50:08] PROBLEM - puppet last run on mw1205 is CRITICAL Puppet has 1 failures [17:06:17] RECOVERY - puppet last run on mw1205 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:08:37] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [17:37:12] I have fixed the issue that caused constant graphoid restarts. Will do deployment now, should be low impact. CC: greg-g godog [17:38:20] https://gerrit.wikimedia.org/r/#/c/211431/ [17:39:18] now? you know it's a saturday evening / morning, if something breaks who will be around to fix it? [17:39:23] yurik: [17:39:39] apergos, i will be here for as long as it takes to fix it [17:40:07] please do babysit it after the deploy then for awhile [17:40:17] i mean, i could wait for monday, but its crashing graphoid nonstop [17:40:34] apergos, http://grafana.wikimedia.org/#/dashboard/db/graphoid [17:40:37] take a look at the top line [17:40:55] the red indicates crashes [17:41:45] apergos, come to think of it, you are right, i shouldn't mess with it for such low numbers. [17:41:52] on weekend [17:41:54] will wait [17:42:02] whew [17:42:06] :D [17:42:11] just because... nothing ever takes 5 minutes :-D [17:42:19] see you back here on monday then [17:42:30] apergos, its not about time - i can totally babysit it until it gets working again [17:42:36] I know but. [17:55:26] (03PS1) 10ArielGlenn: nova monitoring instaces and salt keys: add new options [puppet] - 10https://gerrit.wikimedia.org/r/211432 [17:56:15] (03CR) 10jenkins-bot: [V: 04-1] nova monitoring instaces and salt keys: add new options [puppet] - 10https://gerrit.wikimedia.org/r/211432 (owner: 10ArielGlenn) [17:56:37] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [17:58:45] (03PS2) 10ArielGlenn: nova monitoring instaces and salt keys: add new options [puppet] - 10https://gerrit.wikimedia.org/r/211432 [17:59:38] yurik: you around? [17:59:44] gwicke, yep [17:59:48] hi [17:59:50] hey [18:00:04] the restart issues are not surprising [18:00:41] yeah - but its not good - because if the same service is handling something else at the same time, it will drop processing it [18:00:48] the deploy repo is ready to be synced [18:00:58] but apergos convinced me to wait until monday :) [18:01:12] I did! [18:01:43] I'm taking today to get a few patches in too, but no merges. [18:01:47] what do you mean with 'if the service is handling something else at the same time'? [18:01:54] you mean the restart is not graceful? [18:02:10] gwicke, on the grafena graph, what's the stepping of the graph? e.g. if it says 2 requests, whats the time unit? [18:02:22] re restarts - it crashes with an exception [18:02:29] yurik: rates are per second [18:02:30] the exception is unhandled [18:02:38] hmm, fairly hi [18:02:40] high [18:03:05] gwicke, https://phabricator.wikimedia.org/T99349 [18:03:25] for some things 2/s can maybe be considered high, but I'm not sure if this is one of them ;) [18:03:44] hehe, current is 4 [18:03:45] http://grafana.wikimedia.org/#/dashboard/db/graphoid [18:03:59] 4 restarts per second is a bit high [18:04:30] gwicke, what do you think, should i sync up the graphoid deploy repo today? [18:05:00] the good thing - its fairly isolated from the rest of the platform [18:05:16] if this is so common, how come it didn't show up in tests? [18:05:32] gwicke, because i suspect its one graph thats causing it [18:05:44] can you verify in logstash? [18:05:55] logstash is not showing anything related to graphoid [18:06:12] the issue is that the external data that some graph is ussing is not properly formed [18:06:15] thus causing an exception [18:07:05] is logging not set up properly? [18:07:25] gwicke, i don't know - godog has sent me the logs from the sca1001 machine [18:07:36] that's how i figured it out [18:08:10] we have generally progressed past local logging [18:08:22] would be good to fix that [18:09:26] gwicke, https://phabricator.wikimedia.org/T97615 [18:09:37] 2 weeks ago :) [18:09:39] yurik: what is the rate of restarts relative to the total # of requests? [18:10:06] gwicke, per that graph above (the blue line) - considerable [18:10:10] 1:1 [18:10:14] or 1:2 [18:10:55] hmm [18:11:13] so you are saying it's basically completely broken [18:12:00] so maybe i should deploy. Could you take a look at the patch https://gerrit.wikimedia.org/r/#/c/211431/ [18:12:27] hmm, there doesn't seem to be any logging config in config.yaml [18:12:28] gwicke, i think its not - because the errors would not show as successes [18:12:33] did you just forget to set that up? [18:12:44] gwicke, config is generated on the fyl [18:12:44] fly [18:12:48] is this a critical service/is it an emergency? that's what I would ask before a weekend deployment [18:13:10] critical - well, not much in WP is really critical, its an info service :D [18:13:22] apergos: it's not going to break the site [18:14:26] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:14:36] gwicke, so i am not exactly sure what config puppets generate on the actual service - will wait for the sca1001 access to look [18:14:45] its a template from puppet [18:14:59] yurik: https://github.com/wikimedia/operations-puppet/blob/production/modules/graphoid/templates/config.yaml.erb is missing a logging stanza [18:15:20] example: https://github.com/wikimedia/operations-puppet/blob/production/modules/restbase/templates/config.yaml.erb#L11-L18 [18:15:47] gwicke, i would speak with mobrovac first- maybe he did it for a reason? [18:16:21] him and akosiaris worked on setting it up, would want to ping them first [18:17:42] gwicke, btw, strange - if the logging is not setup, how could there be a log file on sca1001 [18:19:01] default is to log to stdout [18:19:16] stdout is captured? [18:19:35] and afaik alex wanted to redirect that to a local log file, despite the risk of taking the service down on full disk [18:20:14] a lesson we learned with Parsoid [18:20:49] exactly why i don't want to touch it without consulting them first [18:21:03] gwicke, anyway, you thoughts, should i deploy now? [18:21:33] yurik: my main concern with deploying now is that the testing for graphoid doesn't seem to be very comprehensive [18:22:12] on the surface it looks like it's very broken already, so any change is likely to improve things, but it worries me [18:22:14] gwicke, that is true - but testing in this case is fairly complex - its not like we have to test if the service is working, we have to test the entire Vega grammar [18:22:47] e.g. what if vega does some extra processing async, and throws up? [18:23:15] with async processing, it is not possible to just wrap the call in try/catch and handle it [18:23:54] unless js has some global unhandled async handler, which i could simply log and continue... not the best solution [18:24:06] do you have a link to your changes? [18:24:19] you can catch exceptions globally [18:24:24] ^^^^ [18:24:27] sec, will repost [18:24:37] and promises let you propagate exceptions asynchronously [18:24:39] https://gerrit.wikimedia.org/r/#/c/211431/ [18:25:01] gwicke, that's only if the lib uses promises :) [18:25:08] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.72% of data above the critical threshold [1000.0] [18:25:08] which in this case it doesn't :( [18:26:04] i will try to speak with their devs, see if i can convince them to start working in the open, instead of simply publishing their internal work once in a while [18:26:15] did you actually verify that it helps, and that it doesn't break other graphs? [18:26:32] partially - i don't know which graph is causing this issue [18:26:44] i suspect i understand what causes it, but don't know for suer [18:26:45] sure [18:26:52] i am ok with waiting until monday [18:28:28] gwicke, question for you: does a single service instance handle multiple requests? e.g. it handles one at a time, but when it does something async, it handles another request in the mean time [18:28:32] I can have a look at the varnish log to see if I can identify the graph [18:28:39] sounds good [18:29:11] thanks [18:29:18] could i do it too/ [18:30:08] not sure if you have shell on cp1045 and cp1058 [18:30:40] nope ( [18:33:44] the paths will all have 'png' in them, right? [18:34:47] gwicke, yep [18:35:13] hmm, i just realized i could have checked in hive [18:35:47] varnishncsca doesn't seem to show any requests for pngs [18:38:43] gwicke, i will check via hive [18:38:54] (03PS1) 10ArielGlenn: deployment server init should configure repo every time [puppet] - 10https://gerrit.wikimedia.org/r/211435 [18:39:29] yurik: my recommendation is to find the failing graph, create a test for it & verify that it's fixed [18:39:36] (03CR) 10jenkins-bot: [V: 04-1] deployment server init should configure repo every time [puppet] - 10https://gerrit.wikimedia.org/r/211435 (owner: 10ArielGlenn) [18:39:49] gwicke, will try ) [18:39:49] to find the graph, you might have to improve logging [18:40:56] apparently hue is no longer letting me in, i will try to do it via direct querying [18:42:03] (03PS2) 10ArielGlenn: deployment server init should configure repo every time [puppet] - 10https://gerrit.wikimedia.org/r/211435 [18:51:42] gwicke, is it possible that varnish is not logging any %graphoid% access to hadoop? [18:52:12] this returns no results: select * FROM wmf.webrequest WHERE year=2015 AND month=5 AND day=16 AND hour=16 AND uri_host like '%graphoid%' limit 100; [18:54:33] it looks likely [18:54:57] great [18:56:58] the good news is that RB logs backend errors [18:58:40] RB? [18:58:44] restbase [19:20:08] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [19:24:10] (03CR) 10ArielGlenn: "I'd like to see that second approach implemented; updating the source repo from itself is just wrong." [puppet] - 10https://gerrit.wikimedia.org/r/201344 (https://phabricator.wikimedia.org/T94754) (owner: 10BryanDavis) [19:46:25] (03PS1) 10Yurik: Removed localhost access by graphoid [puppet] - 10https://gerrit.wikimedia.org/r/211450 [19:48:27] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [20:06:26] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [20:09:37] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [20:36:31] yurik: https://github.com/wikimedia/restbase/pull/247 [21:09:16] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [22:08:56] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [22:20:16] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:33:08] PROBLEM - puppet last run on ms-be2001 is CRITICAL puppet fail [22:35:27] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [22:50:56] RECOVERY - puppet last run on ms-be2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:58:46] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [23:06:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:11:16] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [23:30:17] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [23:31:38] kart_: where and how can i use contenttranslation?