[00:08:58] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, 10Wikimedia-Labs-Infrastructure: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118444 (10Krinkle) ``` [16:03 CET] krinkle at KrinkleMac in ~ $ host saucelabs.com saucel... [00:21:45] 6operations, 10Wikimedia-Labs-Other, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1118454 (10coren) [00:22:49] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, 10Wikimedia-Labs-Infrastructure: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118458 (10Dzahn) I found the root cause to be this option in /etc/resolv.conf ``` option... [00:26:28] (03PS1) 10Dzahn: don't use 'ndots: 2' in labs resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) [00:27:07] Krinkle: ^ that's it [00:27:11] be back later though [00:27:21] mutante: Nice. [00:27:27] work-around for now: you could also just use [00:27:31] www.saucelabs.com [00:27:41] it will work without fix and same IP [00:27:54] mutante: No I can't. That would require making changes through 10 layers of package dependencies. [00:27:58] saucelabs.com by itself is not really a host [00:28:08] ah, i understand, ok [00:28:10] It's code provided by local builds, not something we maintain and deploy explicitly. [00:28:16] then, hack /etc/resolv.conf and remove that line [00:28:18] It's fetched from packagist/composer/npm at run time. [00:28:21] :) [00:28:31] I'll cherry-pick your patch to our pm? [00:28:59] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, and 2 others: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118464 (10coren) ndots:2 is necessary for something else, the actual bug is that the dnsmasq server should emp... [00:29:26] Krinkle: yes, and then let Coren see it [00:29:49] See ^^ [00:29:54] hu.. it's not there anymore [00:29:57] I dind't cherry-pick anything yet [00:30:00] ndots:2 is gone [00:30:10] Effing [bleep] [bleep] dnsmasq [bleep] [00:30:27] 1 one 100 connections does go through [00:30:29] which is odd [00:30:32] Krinkle: That's odd. [00:30:36] Coren: :) ah! [00:30:36] maybe it's oscillating? [00:30:47] two puppet thingies taking turns? [00:31:00] No, it's just dnsmasq being its usual piece of shit. [00:31:15] ndots:2 is gone and `host saucelabs.com` now resolves. [00:31:52] Krinkle: Yeah, but ndots:2 is necessary for properly managing the database names, and should *not* cause this. [00:32:03] Coren: I understand. [00:32:07] Coren: I didn't change it though. [00:32:16] 6 minuts ago something changed /etc/resolve.conf [00:33:26] on that one instance i did, but puppet will change it back [00:40:59] (03CR) 10Tim Landscheidt: [C: 04-1] "Without ndots:2, the search parameter is no longer useful (the purpose of change I290c32fdeedc9319b804f886c21fe1c49a6c864f was to allow en" [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [00:44:35] 6operations, 10Wikimedia-Labs-Other, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1118490 (10Krenair) [00:44:47] 6operations, 10Wikimedia-Labs-Other, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#530760 (10Krenair) [00:49:38] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, and 2 others: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118500 (10scfc) The error doesn't seem to lie with dnsmasq. On `tools-login`, the look-up succeeds, on `tools... [00:50:23] !log legoktm Synchronized php-1.25wmf21/extensions/MassMessage/includes/job/MassMessageServerSideJob.php: https://gerrit.wikimedia.org/r/#/c/196729/ (duration: 00m 06s) [00:50:31] Logged the message, Master [00:51:01] 6operations, 10Wikimedia-Labs-Other, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1118515 (10coren) [00:51:03] mutante: ah, okay, that explains. [00:51:12] !log legoktm Synchronized php-1.25wmf20/extensions/MassMessage/includes/job/MassMessageServerSideJob.php: https://gerrit.wikimedia.org/r/#/c/196729/ (duration: 00m 09s) [00:51:16] Logged the message, Master [00:53:20] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:56] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, and 2 others: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118537 (10coren) No, it's just that the Precise libresolv seems to be a little more forgiving and skips over t... [00:56:31] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60645 bytes in 0.048 second response time [01:03:15] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, and 2 others: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118547 (10coren) To wit: ``` marc@tools-trusty:~$ host notexist Host notexist.eqiad.wmflabs not found: 2(SERVF... [01:03:43] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, and 2 others: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118555 (10Krinkle) I suspect this error got introduced when I switched over the CI pool from the Trusty instan... [01:10:00] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, and 2 others: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118565 (10Krinkle) [01:10:02] 7Puppet, 6operations, 10Continuous-Integration: Puppet (silently) fails to setup apache on some integration-slave14xx instances - https://phabricator.wikimedia.org/T91832#1118566 (10Krinkle) [01:24:19] (03CR) 10Krinkle: "Cherry-picked to integration-puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [01:36:26] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, and 2 others: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118579 (10scfc) @Coren: But there are you querying the Labs server, and (I think) dnsmasq just passes the requ... [01:39:24] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, and 2 others: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118580 (10scfc) And: ``` scfc@tools-login:~$ dig @10.68.16.1 tools-login.eqiad.wmflabs ; <<>> DiG 9.8.1-P1 <... [01:41:03] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, and 2 others: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118589 (10scfc) (Or a host name that does not exist.) [01:41:38] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, and 2 others: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118590 (10coren) >>! In T92351#1118579, @scfc wrote: > So (from a distance) it appears as if the WMF nameserve... [01:51:05] 6operations, 10Math, 7Wikimedia-log-errors: Fatal in production due to outdated production: Missing "texvccheck" executable - https://phabricator.wikimedia.org/T92707#1118592 (10Krenair) 3NEW [01:53:21] PROBLEM - puppet last run on mw2008 is CRITICAL: CRITICAL: puppet fail [01:56:33] 6operations, 10Math, 7Wikimedia-log-errors: Fatal in production due to outdated package: Missing "texvccheck" executable - https://phabricator.wikimedia.org/T92707#1118600 (10Krenair) [02:00:17] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, and 2 others: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118610 (10scfc) http://www.linuxquestions.org/questions/linux-networking-3/powerdns-servfail-945615/ (NB: MySQ... [02:00:46] (03CR) 10Krinkle: [C: 031] "Cherry-picked to integration-puppetmaster since it was causing puppet errors." [puppet] - 10https://gerrit.wikimedia.org/r/196174 (https://phabricator.wikimedia.org/T92482) (owner: 10Hashar) [02:07:05] 6operations, 10Wikimedia-Labs-wikitech-interface: wikitech.wikimedia.org SSL certificate considered "outdated security" in Chrome - https://phabricator.wikimedia.org/T92709#1118619 (10Krinkle) 3NEW [02:08:34] 6operations, 10Wikimedia-Labs-wikitech-interface, 7HTTPS: wikitech.wikimedia.org SSL certificate considered "outdated security" in Chrome - https://phabricator.wikimedia.org/T92709#1118626 (10Krenair) [02:09:51] Coren: mutante: Cherry-picking that patch and runnig puppet does not make resolv.conf change [02:09:53] It's still there [02:13:10] RECOVERY - puppet last run on mw2008 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [02:14:01] krenair@silver:~$ mwscript eval.php --wiki=labswiki [02:14:01] mkdir: cannot create directory �/sys/fs/cgroup/memory/mediawiki/job/31885�: No such file or directory [02:14:01] limit.sh: failed to create the cgroup. [02:14:02] sigh [02:14:29] 6operations, 10Continuous-Integration, 6Labs: Evaluate options to make puppet errors more visible - https://phabricator.wikimedia.org/T92710#1118636 (10Krinkle) 3NEW [02:19:45] 6operations, 10Continuous-Integration, 6Labs: Evaluate options to make puppet errors more visible - https://phabricator.wikimedia.org/T92710#1118645 (10scfc) What do you mean by "puppet failures and random regressions" in this case? [02:21:33] (03CR) 10Alex Monk: "This is causing fluorine.eqiad.wmnet:/a/mw-log/eventlogging.log to be full of these errors:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158405 (owner: 10Andrew Bogott) [02:24:39] !log l10nupdate Synchronized php-1.25wmf20/cache/l10n: (no message) (duration: 06m 37s) [02:24:52] Logged the message, Master [02:27:27] 6operations: mwscript showing errors on silver/labswiki/wikitech - https://phabricator.wikimedia.org/T92712#1118661 (10Krenair) 3NEW a:3Andrew [02:29:14] !log LocalisationUpdate completed (1.25wmf20) at 2015-03-14 02:28:10+00:00 [02:29:22] Logged the message, Master [02:31:31] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [02:31:31] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [02:38:39] 6operations, 10Continuous-Integration, 6Labs: Evaluate options to make puppet errors more visible - https://phabricator.wikimedia.org/T92710#1118670 (10Krinkle) >>! In T92710#1118645, @scfc wrote: > What do you mean by "puppet failures and random regressions" in this case? Integrity errors or quality issues... [02:44:11] RECOVERY - HTTP 5xx req/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:44:23] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:44:27] (03CR) 10Krinkle: [C: 04-1] "Did not work. ndots:2 is stilll there on the instances after cherry-picking and forcing a puppet run. saucelabs still inaccessible." [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [02:49:40] !log l10nupdate Synchronized php-1.25wmf21/cache/l10n: (no message) (duration: 06m 28s) [02:49:49] Logged the message, Master [02:50:07] 6operations, 10Math, 7Wikimedia-log-errors: Fatal in production due to outdated package: Missing "texvccheck" executable - https://phabricator.wikimedia.org/T92707#1118687 (10Physikerwelt) see also T91191 [02:54:22] !log LocalisationUpdate completed (1.25wmf21) at 2015-03-14 02:53:19+00:00 [02:54:30] Logged the message, Master [03:59:39] 6operations, 10Wikimedia-Labs-wikitech-interface, 7HTTPS: wikitech.wikimedia.org SSL certificate considered "outdated security" in Chrome - https://phabricator.wikimedia.org/T92709#1118701 (10Dzahn) [04:21:14] 6operations, 10Continuous-Integration, 6Labs, 10OOjs, and 2 others: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118720 (10Dzahn) about workarounds: /etc/nsswitch says: hosts: files dns so to first check files an... [04:26:15] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia: sign cortado applet so that it works for people with outdated java - https://phabricator.wikimedia.org/T62287#1118724 (10Dzahn) also see T83995 (duplicate?) [04:30:44] 6operations, 10Wikimedia-Labs-wikitech-interface, 7HTTPS: wikitech.wikimedia.org SSL certificate considered "outdated security" in Chrome - https://phabricator.wikimedia.org/T92709#1118731 (10Dzahn) this should be T73156 (SHA1 needs to be replaced with a SHA256 cert) Chrome 39 will warn users if SHA1 certif... [04:31:13] 6operations, 7HTTPS, 5Patch-For-Review: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#753141 (10Dzahn) [04:31:14] 6operations, 10Wikimedia-Labs-wikitech-interface, 7HTTPS: wikitech.wikimedia.org SSL certificate considered "outdated security" in Chrome - https://phabricator.wikimedia.org/T92709#1118734 (10Dzahn) [04:38:10] RECOVERY - Graphite Carbon on graphite2001 is OK: OK: All defined Carbon jobs are runnning. [04:40:27] 6operations: Digitally sign cortado video player java applet - https://phabricator.wikimedia.org/T83995#1118735 (10Krenair) [04:41:41] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [06:28:50] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:51] PROBLEM - puppet last run on amssq46 is CRITICAL: CRITICAL: puppet fail [06:29:11] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:01] PROBLEM - puppet last run on analytics1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:51] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:50] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:11] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:31] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 2 failures [06:45:31] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:01] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on analytics1010 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:46:51] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:47:01] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:47:20] RECOVERY - puppet last run on amssq46 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:16:37] (03PS7) 10Yuvipanda: [WIP] ldap+yaml file puppet ENC for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/196628 [07:21:34] 7Puppet, 6operations, 10Beta-Cluster: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#1118800 (10yuvipanda) [07:21:35] 7Puppet, 6operations, 10Beta-Cluster, 5Patch-For-Review: Use keyholder for deploy key management - https://phabricator.wikimedia.org/T92367#1118798 (10yuvipanda) 5Open>3Resolved Done now. Anyone who is a member of the deployment-prep project can now run scap without having to sudo to anything. [07:22:39] 7Puppet, 10Beta-Cluster, 5Patch-For-Review: Unify labs and prod roles for role::deployment::deployment_servers - https://phabricator.wikimedia.org/T86885#1118801 (10yuvipanda) 5Open>3Resolved a:3yuvipanda DDDONE. That was painful :) See I3e947637b49ce2a94128e21db35798a49e8d45e8 [07:22:40] 7Puppet, 10Beta-Cluster, 5Patch-For-Review, 7Tracking: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#1118804 (10yuvipanda) [07:25:03] 7Puppet, 6operations, 10Beta-Cluster, 10Staging: Move scap puppet code into a module - https://phabricator.wikimedia.org/T87221#1118807 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Done. beta/scap is gone. [07:25:04] 7Puppet, 6operations, 10Beta-Cluster: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#1118810 (10yuvipanda) [07:28:52] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Mar 14 07:27:49 UTC 2015 (duration 27m 48s) [07:29:03] Logged the message, Master [07:33:16] 500 pattern looks like redis again? https://gdash.wikimedia.org/dashboards/reqerror/ [07:46:29] (03PS1) 10Yuvipanda: mediawiki: Ensure that /etc/php5/apache dir exists [puppet] - 10https://gerrit.wikimedia.org/r/196773 [07:46:33] _joe_: ^ if you’re working :P [07:47:14] <_joe_> YuviPanda: -1 [07:47:40] I’m getting puppte fails on a newly provisioned deployment host, otherwise. [07:50:01] (03PS2) 10Yuvipanda: mediawiki: Ensure that /etc/php5/apache dir exists [puppet] - 10https://gerrit.wikimedia.org/r/196773 (https://phabricator.wikimedia.org/T88442) [07:50:35] <_joe_> YuviPanda: the deployment host should not need that dir probably? [07:50:43] <_joe_> but well on monday :) [07:50:46] yeh :D [07:50:55] * YuviPanda is planning on going to the beach shortly [07:50:58] <_joe_> the correct thing would be to make that dependent on libapache-mod-php5 [07:51:07] right. [07:51:13] <_joe_> but I have a few issues to address before that [07:51:19] the deployment host does have mediawiki included in it, though [08:37:37] 7Puppet, 6operations, 5Patch-For-Review: Convert host lists in dsh/files/groups to hiera - https://phabricator.wikimedia.org/T92259#1118830 (10yuvipanda) p:5Triage>3Normal [08:40:08] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-App-Android-App, and 3 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1118832 (10yuvipanda) This needs a VCL patch as well. [08:42:48] (03PS1) 10Yuvipanda: labs: set resolf.conf ndots to 1 [puppet] - 10https://gerrit.wikimedia.org/r/196775 (https://phabricator.wikimedia.org/T92351) [08:45:15] (03PS2) 10Yuvipanda: labs: set resolf.conf ndots to 1 [puppet] - 10https://gerrit.wikimedia.org/r/196775 (https://phabricator.wikimedia.org/T92351) [08:46:33] (03PS3) 10Yuvipanda: labs: set resolf.conf ndots to 1 [puppet] - 10https://gerrit.wikimedia.org/r/196775 (https://phabricator.wikimedia.org/T92351) [08:50:21] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [08:55:31] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [08:59:01] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [08:59:01] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [09:09:47] (03Abandoned) 10Yuvipanda: labs: set resolf.conf ndots to 1 [puppet] - 10https://gerrit.wikimedia.org/r/196775 (https://phabricator.wikimedia.org/T92351) (owner: 10Yuvipanda) [09:11:41] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:11:41] RECOVERY - HTTP 5xx req/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:16:31] (03CR) 10Yuvipanda: "Email to engineering@ sent." [puppet] - 10https://gerrit.wikimedia.org/r/195840 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [11:16:01] RECOVERY - puppet last run on mw2001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [11:20:00] PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:11] RECOVERY - puppet last run on mw1102 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:32:52] 6operations, 7HTTPS, 7Performance: Support SPDY - https://phabricator.wikimedia.org/T35890#1118906 (10faidon) 5Open>3Resolved a:3faidon @bblack tackled this while implementing T86648. All HTTP frontends are now running an newer stack and have SPDY enabled. There is a number of subsequent performance en... [12:37:01] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [12:37:01] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [12:37:21] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [12:37:22] PROBLEM - HTTP error ratio anomaly detection on graphite2001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [12:48:52] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [12:49:00] PROBLEM - HTTP error ratio anomaly detection on graphite2001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [12:51:15] (03PS1) 10Glaisher: Add 'autopatrol' protection level to lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196779 (https://phabricator.wikimedia.org/T92645) [13:02:21] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:02:21] RECOVERY - HTTP 5xx req/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:24:04] 6operations, 10Continuous-Integration, 10Incident-20150312-whitespace, 6MediaWiki-Core-Team: add a check for whitespace before leading 6operations, 10Continuous-Integration, 6Labs, 10OOjs, and 2 others: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1118965 (10scfc) Has someone looked at whether there is an SOA record in LDAP? If that is the source of the pr... [14:01:57] 6operations, 10Continuous-Integration, 6Labs: Evaluate options to make puppet errors more visible - https://phabricator.wikimedia.org/T92710#1118969 (10scfc) There are two aspects to this: # Whether the Git failure caused Puppet to fail. This seems to have been the case. # Whether the Puppet failure trigge... [14:10:30] RECOVERY - HTTP error ratio anomaly detection on graphite2001 is OK: OK: No anomaly detected [14:10:31] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [15:40:12] 6operations, 6CA-team, 6Commons, 6MediaWiki-Core-Team, 10SUL-Finalization: db1068 (s4/commonswiki slave) is missing data about at least 6 users - https://phabricator.wikimedia.org/T91920#1119071 (10Steinsplitter) [15:55:03] (03PS1) 10Glaisher: Remove $wgSpamRegex in favor of AbuseFilter and SpamBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196782 (https://phabricator.wikimedia.org/T50491) [15:56:00] (03CR) 10Glaisher: "Might be useful to see how many hits we get in the SpamRegex log." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196782 (https://phabricator.wikimedia.org/T50491) (owner: 10Glaisher) [16:16:46] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3053 - https://phabricator.wikimedia.org/T92514#1119086 (10mark) Asset tags have been applied to all servers. From the bottom of rack OE10 going upwards (20 servers), they have asset tags WMF4405 - WMF4424 [16:17:21] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3053 - https://phabricator.wikimedia.org/T92514#1119087 (10mark) [16:18:59] 6operations, 10ops-esams: Rack and configure asw-esams (new 2xQFX5100 stack) - https://phabricator.wikimedia.org/T91643#1119088 (10mark) We decided to put switch "asw-oe11-esams" in rack OE10 instead, so will become asw-oe10-esams. It has asset tag WMF4425. Serial port desc on the SCS still needs updating. [17:32:12] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3049 - https://phabricator.wikimedia.org/T92514#1119136 (10BBlack) [17:33:18] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3049 - https://phabricator.wikimedia.org/T92514#1113544 (10BBlack) [17:33:19] 6operations, 7HTTPS, 3HTTPS-by-default: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#1119139 (10BBlack) [17:37:57] (03PS1) 10BBlack: remove cp3050-3053 dns [dns] - 10https://gerrit.wikimedia.org/r/196783 [17:38:17] (03CR) 10BBlack: [C: 032] remove cp3050-3053 dns [dns] - 10https://gerrit.wikimedia.org/r/196783 (owner: 10BBlack) [18:00:30] mw2008 has hyper threading disabled [18:20:57] 6operations, 10ops-codfw: mw2008 has Hyper threading disabled - https://phabricator.wikimedia.org/T92738#1119157 (10hoo) 3NEW [18:24:44] (03PS2) 10Matanya: nova: lint compute.pp [puppet] - 10https://gerrit.wikimedia.org/r/195535 [18:37:17] (03CR) 10Matanya: [C: 031] "looks good." [puppet] - 10https://gerrit.wikimedia.org/r/196621 (owner: 10Andrew Bogott) [19:11:30] (03PS1) 10Dzahn: delete adminbot config template [puppet] - 10https://gerrit.wikimedia.org/r/196787 [19:21:16] (03CR) 10Alex Monk: "krenair@fluorine:/a/mw-log$ grep "2015-03-14 18:" spam.log -c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196782 (https://phabricator.wikimedia.org/T50491) (owner: 10Glaisher) [19:34:50] (03CR) 10Tim Landscheidt: [C: 031] "IIRC, adminbot is used by morebots ("!log" => SAL; or in other words: morebots is a tiny wrapper around adminbot). But the config for mor" [puppet] - 10https://gerrit.wikimedia.org/r/196787 (owner: 10Dzahn) [20:16:45] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security: securing the RESTBase Cassandra cluster - https://phabricator.wikimedia.org/T92680#1119258 (10Krenair) [20:31:06] (03PS1) 10Faidon Liambotis: Add cp3030-cp3049 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/196793 [21:05:27] (03CR) 10Faidon Liambotis: [C: 032] "Tested with cp3030 first :)" [puppet] - 10https://gerrit.wikimedia.org/r/196793 (owner: 10Faidon Liambotis) [21:40:51] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3049 - https://phabricator.wikimedia.org/T92514#1119296 (10Multichill) {F91868} servers are in the rack and hooked up to a management and production switch. * Bottom ten servers are port 38-47 (counting from the bottom) on the production switch * Top ten s... [21:54:22] (03CR) 10Legoktm: "Hmm, up until Feb 20th, we were getting 0 hits in logs, since then we're getting about 45 hits a day." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196782 (https://phabricator.wikimedia.org/T50491) (owner: 10Glaisher) [22:22:38] PROBLEM - DPKG on cp3037 is CRITICAL: Connection refused by host [22:22:38] PROBLEM - RAID on cp3041 is CRITICAL: Connection refused by host [22:22:38] PROBLEM - puppet last run on cp3042 is CRITICAL: Connection refused by host [22:22:38] PROBLEM - salt-minion processes on cp3039 is CRITICAL: Connection refused by host [22:22:38] PROBLEM - dhclient process on cp3045 is CRITICAL: Connection refused by host [22:22:39] PROBLEM - Disk space on cp3049 is CRITICAL: Connection refused by host [22:22:56] PROBLEM - Disk space on cp3037 is CRITICAL: Connection refused by host [22:22:56] PROBLEM - salt-minion processes on cp3042 is CRITICAL: Connection refused by host [22:22:56] PROBLEM - configured eth on cp3038 is CRITICAL: Connection refused by host [22:22:56] PROBLEM - DPKG on cp3040 is CRITICAL: Connection refused by host [22:22:56] PROBLEM - RAID on cp3044 is CRITICAL: Connection refused by host [22:22:57] PROBLEM - puppet last run on cp3045 is CRITICAL: Connection refused by host [22:23:06] PROBLEM - DPKG on cp3043 is CRITICAL: Connection refused by host [22:23:07] PROBLEM - Disk space on cp3040 is CRITICAL: Connection refused by host [22:23:07] PROBLEM - RAID on cp3049 is CRITICAL: Connection refused by host [22:23:07] PROBLEM - configured eth on cp3041 is CRITICAL: Connection refused by host [22:23:07] PROBLEM - salt-minion processes on cp3045 is CRITICAL: Connection refused by host [22:23:07] PROBLEM - dhclient process on cp3038 is CRITICAL: Connection refused by host [22:23:17] PROBLEM - DPKG on cp3047 is CRITICAL: Connection refused by host [22:23:17] PROBLEM - puppet last run on cp3038 is CRITICAL: Connection refused by host [22:23:17] PROBLEM - dhclient process on cp3041 is CRITICAL: Connection refused by host [22:23:17] PROBLEM - Disk space on cp3043 is CRITICAL: Connection refused by host [22:23:17] PROBLEM - RAID on cp3037 is CRITICAL: Connection refused by host [22:23:18] PROBLEM - configured eth on cp3044 is CRITICAL: Connection refused by host [22:23:27] PROBLEM - salt-minion processes on cp3038 is CRITICAL: Connection refused by host [22:23:27] PROBLEM - puppet last run on cp3041 is CRITICAL: Connection refused by host [22:23:27] PROBLEM - Disk space on cp3047 is CRITICAL: Connection refused by host [22:23:28] PROBLEM - dhclient process on cp3044 is CRITICAL: Connection refused by host [22:23:28] PROBLEM - RAID on cp3040 is CRITICAL: Connection refused by host [22:23:28] PROBLEM - configured eth on cp3049 is CRITICAL: Connection refused by host [22:23:37] PROBLEM - configured eth on cp3037 is CRITICAL: Connection refused by host [22:23:37] PROBLEM - puppet last run on cp3044 is CRITICAL: Connection refused by host [22:23:37] PROBLEM - DPKG on cp3039 is CRITICAL: Connection refused by host [22:23:37] PROBLEM - RAID on cp3043 is CRITICAL: Connection refused by host [22:23:37] PROBLEM - dhclient process on cp3049 is CRITICAL: Connection refused by host [22:23:38] PROBLEM - salt-minion processes on cp3041 is CRITICAL: Connection refused by host [22:23:47] PROBLEM - DPKG on cp3042 is CRITICAL: Connection refused by host [22:23:47] PROBLEM - dhclient process on cp3037 is CRITICAL: Connection refused by host [22:23:47] PROBLEM - Disk space on cp3039 is CRITICAL: Connection refused by host [22:23:48] PROBLEM - configured eth on cp3040 is CRITICAL: Connection refused by host [22:23:48] PROBLEM - salt-minion processes on cp3044 is CRITICAL: Connection refused by host [22:23:48] PROBLEM - RAID on cp3047 is CRITICAL: Connection refused by host [22:23:48] PROBLEM - puppet last run on cp3049 is CRITICAL: Connection refused by host [22:24:06] PROBLEM - DPKG on cp3045 is CRITICAL: Connection refused by host [22:24:07] PROBLEM - Disk space on cp3042 is CRITICAL: Connection refused by host [22:24:07] PROBLEM - dhclient process on cp3040 is CRITICAL: Connection refused by host [22:24:07] PROBLEM - puppet last run on cp3037 is CRITICAL: Connection refused by host [22:24:07] PROBLEM - salt-minion processes on cp3049 is CRITICAL: Connection refused by host [22:24:07] PROBLEM - configured eth on cp3043 is CRITICAL: Connection refused by host [22:24:16] PROBLEM - puppet last run on cp3040 is CRITICAL: Connection refused by host [22:24:17] PROBLEM - Disk space on cp3045 is CRITICAL: Connection refused by host [22:24:17] PROBLEM - dhclient process on cp3043 is CRITICAL: Connection refused by host [22:24:17] PROBLEM - RAID on cp3039 is CRITICAL: Connection refused by host [22:24:17] PROBLEM - configured eth on cp3047 is CRITICAL: Connection refused by host [22:24:17] PROBLEM - salt-minion processes on cp3037 is CRITICAL: Connection refused by host [22:24:27] PROBLEM - salt-minion processes on cp3040 is CRITICAL: Connection refused by host [22:24:27] PROBLEM - DPKG on cp3038 is CRITICAL: Connection refused by host [22:24:27] PROBLEM - RAID on cp3042 is CRITICAL: Connection refused by host [22:24:27] PROBLEM - puppet last run on cp3043 is CRITICAL: Connection refused by host [22:24:27] PROBLEM - dhclient process on cp3047 is CRITICAL: Connection refused by host [22:24:46] PROBLEM - configured eth on cp3039 is CRITICAL: Connection refused by host [22:24:46] PROBLEM - Disk space on cp3038 is CRITICAL: Connection refused by host [22:24:46] PROBLEM - RAID on cp3045 is CRITICAL: Connection refused by host [22:24:46] PROBLEM - DPKG on cp3041 is CRITICAL: Connection refused by host [22:24:46] PROBLEM - salt-minion processes on cp3043 is CRITICAL: Connection refused by host [22:24:47] PROBLEM - puppet last run on cp3047 is CRITICAL: Connection refused by host [22:24:47] PROBLEM - Disk space on cp3041 is CRITICAL: Connection refused by host [22:24:48] PROBLEM - dhclient process on cp3039 is CRITICAL: Connection refused by host [22:24:48] PROBLEM - salt-minion processes on cp3047 is CRITICAL: Connection refused by host [22:24:49] PROBLEM - DPKG on cp3044 is CRITICAL: Connection refused by host [22:24:49] PROBLEM - configured eth on cp3042 is CRITICAL: Connection refused by host [22:24:57] PROBLEM - RAID on cp3038 is CRITICAL: Connection refused by host [22:24:57] PROBLEM - dhclient process on cp3042 is CRITICAL: Connection refused by host [22:24:57] PROBLEM - puppet last run on cp3039 is CRITICAL: Connection refused by host [22:24:57] PROBLEM - Disk space on cp3044 is CRITICAL: Connection refused by host [22:24:58] PROBLEM - DPKG on cp3049 is CRITICAL: Connection refused by host [22:24:58] PROBLEM - configured eth on cp3045 is CRITICAL: Connection refused by host [22:29:40] ^ all of that is some new hardware being installed. monitoring happened to see it before it rightly should. nothing's broke for production. [22:32:07] RECOVERY - Disk space on cp3037 is OK: DISK OK [22:32:27] RECOVERY - salt-minion processes on cp3037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:32:37] RECOVERY - RAID on cp3037 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [22:32:57] RECOVERY - configured eth on cp3037 is OK: NRPE: Unable to read output [22:33:06] RECOVERY - DPKG on cp3037 is OK: All packages OK [22:33:06] RECOVERY - dhclient process on cp3037 is OK: PROCS OK: 0 processes with command name dhclient [22:33:17] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [22:40:55] RECOVERY - RAID on cp3038 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [22:41:13] RECOVERY - RAID on cp3040 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [22:41:14] RECOVERY - configured eth on cp3038 is OK: NRPE: Unable to read output [22:41:14] RECOVERY - DPKG on cp3040 is OK: All packages OK [22:41:14] RECOVERY - Disk space on cp3038 is OK: DISK OK [22:41:14] RECOVERY - dhclient process on cp3040 is OK: PROCS OK: 0 processes with command name dhclient [22:41:33] RECOVERY - Disk space on cp3040 is OK: DISK OK [22:41:34] RECOVERY - dhclient process on cp3038 is OK: PROCS OK: 0 processes with command name dhclient [22:41:34] RECOVERY - RAID on cp3039 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [22:41:34] RECOVERY - salt-minion processes on cp3040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:41:43] RECOVERY - salt-minion processes on cp3038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:41:53] RECOVERY - DPKG on cp3039 is OK: All packages OK [22:41:54] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [22:41:54] RECOVERY - DPKG on cp3038 is OK: All packages OK [22:41:54] RECOVERY - dhclient process on cp3039 is OK: PROCS OK: 0 processes with command name dhclient [22:42:03] RECOVERY - Disk space on cp3039 is OK: DISK OK [22:42:03] RECOVERY - salt-minion processes on cp3039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:42:03] RECOVERY - configured eth on cp3040 is OK: NRPE: Unable to read output [22:42:04] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [22:42:14] RECOVERY - configured eth on cp3039 is OK: NRPE: Unable to read output [22:42:33] RECOVERY - configured eth on cp3044 is OK: NRPE: Unable to read output [22:42:33] RECOVERY - configured eth on cp3049 is OK: NRPE: Unable to read output [22:42:43] RECOVERY - configured eth on cp3041 is OK: NRPE: Unable to read output [22:42:43] RECOVERY - salt-minion processes on cp3045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:42:43] RECOVERY - DPKG on cp3043 is OK: All packages OK [22:42:43] RECOVERY - RAID on cp3049 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [22:42:43] RECOVERY - Disk space on cp3045 is OK: DISK OK [22:42:44] RECOVERY - dhclient process on cp3043 is OK: PROCS OK: 0 processes with command name dhclient [22:42:44] RECOVERY - configured eth on cp3047 is OK: NRPE: Unable to read output [22:42:45] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [22:42:45] RECOVERY - DPKG on cp3047 is OK: All packages OK [22:42:46] RECOVERY - dhclient process on cp3044 is OK: PROCS OK: 0 processes with command name dhclient [22:42:46] RECOVERY - dhclient process on cp3049 is OK: PROCS OK: 0 processes with command name dhclient [22:42:54] RECOVERY - Disk space on cp3047 is OK: DISK OK [22:42:54] RECOVERY - DPKG on cp3041 is OK: All packages OK [22:42:54] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [22:43:03] RECOVERY - dhclient process on cp3041 is OK: PROCS OK: 0 processes with command name dhclient [22:43:03] RECOVERY - Disk space on cp3043 is OK: DISK OK [22:43:03] RECOVERY - dhclient process on cp3047 is OK: PROCS OK: 0 processes with command name dhclient [22:43:03] RECOVERY - RAID on cp3042 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [22:43:03] RECOVERY - Disk space on cp3041 is OK: DISK OK [22:43:04] RECOVERY - DPKG on cp3044 is OK: All packages OK [22:43:04] RECOVERY - salt-minion processes on cp3047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:43:05] RECOVERY - configured eth on cp3042 is OK: NRPE: Unable to read output [22:43:05] RECOVERY - DPKG on cp3042 is OK: All packages OK [22:43:06] RECOVERY - RAID on cp3047 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [22:43:06] RECOVERY - RAID on cp3041 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [22:43:07] RECOVERY - salt-minion processes on cp3044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:43:07] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:43:13] RECOVERY - dhclient process on cp3045 is OK: PROCS OK: 0 processes with command name dhclient [22:43:13] RECOVERY - Disk space on cp3049 is OK: DISK OK [22:43:13] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:43:13] RECOVERY - Disk space on cp3044 is OK: DISK OK [22:43:13] RECOVERY - dhclient process on cp3042 is OK: PROCS OK: 0 processes with command name dhclient [22:43:14] RECOVERY - DPKG on cp3049 is OK: All packages OK [22:43:14] RECOVERY - configured eth on cp3045 is OK: NRPE: Unable to read output [22:43:15] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [22:43:15] RECOVERY - RAID on cp3045 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [22:43:33] RECOVERY - salt-minion processes on cp3042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:43:34] RECOVERY - RAID on cp3044 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [22:43:34] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [22:43:34] RECOVERY - salt-minion processes on cp3041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:43:34] RECOVERY - configured eth on cp3043 is OK: NRPE: Unable to read output [22:43:34] RECOVERY - Disk space on cp3042 is OK: DISK OK [22:43:34] RECOVERY - salt-minion processes on cp3049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:43:35] RECOVERY - DPKG on cp3045 is OK: All packages OK [22:51:39] hmm, db error on meta? [22:53:02] site is down [22:53:13] hmm... that's not a good bug... the action I was trying to do "happened" but it didn't get logged... [22:53:13] 503 Service Temporarily Unavailable [22:53:21] because of the db error [22:53:42] bblack: site is down [22:53:48] every time someone says "site is down", I look and it's not. be more specific! [22:54:05] bblack: 503 Service Temporarily Unavailable [22:54:13] (Cannot access the database: Can't connect to MySQL server on '10.64.16.22' (4) (10.64.16.22)) [22:54:17] now back [22:54:24] and off again [22:54:28] so probably only for uncached pages [22:54:35] or for logged-in users [22:54:43] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3049 - https://phabricator.wikimedia.org/T92514#1119336 (10faidon) All servers are now installed, running puppet etc. now. iDRAC has also been configured with the right IPs and with a non-default password. No BIOS or other iDRAC settings set yet. asw-esams... [22:54:51] I'm logged in and still have the site, but got that error when I tried to lock an account [22:55:04] (the lock actually happened, but nothing appears to have been logged) [22:55:14] I'm logged in now and haven't hit a failure yet clicking "random article" links [22:55:26] i'm logged in to bblack [22:55:29] oh now I'm failing [22:55:38] Getting the chrome Aw, Snap! error [22:55:43] on all meta pages [22:56:02] for me on he.wiki [22:56:52] I'm ok on random pages on he.wiki too [22:56:55] now getting pages again, so clearly intermittent at some level [22:57:08] * jamesofur is at the office for the record (and therefore connected directly to ULSFO) [22:57:44] hmm [22:58:46] I don't see anything crazy on db1033 yet [22:59:05] huh... my traceroute to enWiki is odd [23:00:01] (core1.corp as expected, then 10.149.0.1 then just time outs) [23:00:05] (db1033 being 10.64.16.22 from mysql error above - looks pretty stable on status, nothing odd in syslog/dmesg, etc) [23:00:16] jamesofur: that's just weird office network shit, ignore that [23:00:21] fair enough [23:00:37] so, lots of dberrors for 2 days now [23:01:32] all over the place [23:01:55] bblack: https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=MySQL%20eqiad&h=db1033.eqiad.wmnet&r=hour&z=small&jr=&js=&st=1426374059&v=0&m=mysql_aborted_clients&vl=clients&ti=mysql_aborted_clients&z=large ? [23:02:02] springle: ping [23:04:07] it's a max_connections issue [23:04:15] | max_connections | 2500 | [23:04:22] | Max_used_connections | 2501 | [23:04:29] it's hitting the connection limit [23:05:23] (that's db1033 global vars stuff) [23:05:46] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=MySQL+eqiad&h=&tab=m&vn=&hide-hf=false&m=mysql_connections&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [23:05:55] that's not it [23:06:02] max_used_connections is the max ever used [23:06:02] no? [23:06:09] not the current value [23:06:26] yeah and max_connections is the max allowed. so at one point in its life, it used every available connection. [23:06:41] sure [23:06:46] now it's 38 though :P [23:07:10] because clients are lame and stopped trying hard? or really? [23:07:19] I guess ganglia graph would know though [23:08:00] https://tendril.wikimedia.org/host/view/db1033.eqiad.wmnet/3306 [23:08:11] but I'm looking at dberror.log [23:08:16] it doesn't look specific to db1033 at all [23:09:24] heh none of those show "number of active connections" [23:10:25] hmm [23:10:36] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: Puppet has 1 failures [23:11:13] the only thing i see in gerrit that might be relevant is https://gerrit.wikimedia.org/r/194788 but it was merged over a week ago [23:13:08] oh ganglia has it: [23:13:09] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=MySQL%20eqiad&h=db1033.eqiad.wmnet&r=4hr&z=default&jr=&js=&st=1426373969&v=44&m=mysql_threads_connected&vl=threads&ti=mysql_threads_connected&z=large [23:13:25] so yeah, not maxed out on conns presently, just some time in the distant past [23:13:52] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=MySQL%20eqiad&h=db1021.eqiad.wmnet&r=week&z=default&jr=&js=&st=1426374807&v=10&m=mysql_aborted_connects&vl=conns&ti=mysql_aborted_connects&z=large [23:14:04] consistent with dberror.log [23:14:19] yup [23:14:21] and recent, too [23:14:34] yeah, dberror.log shows a large spike from the 12th onwards [23:14:35] so, what's causing that? [23:14:50] paravoid: nothing before ? [23:14:58] faidon@fluorine:/a/mw-log$ for i in 12 13 14; do zcat archive/dberror.log-201503$i.gz |wc -l; done [23:15:01] 39380 [23:15:04] 278941 [23:15:06] 442346 [23:17:23] any exact timestamp on the 12th ? [23:17:25] which is consistent with: https://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=MySQL+eqiad&h=db1021.eqiad.wmnet&jr=&js=&v=10&m=mysql_aborted_connects&vl=conns&ti=mysql_aborted_connects [23:17:35] yes [23:19:04] Mar 14 23:18:52 mw1117: #012Fatal error: request has exceeded memory limit in /srv/mediawiki/php-1.25wmf20/includes/db/DatabaseMysqli.php on line 183 [23:19:08] circa 19:30 -> 20:30 -ish (UTC on Mar 12) [23:19:14] is when the graph starting taking off with errors [23:21:13] candidate: https://gerrit.wikimedia.org/r/#/c/195853/ [23:21:48] doubt it :) [23:22:18] yes, cx us not in wide use yet [23:22:22] *is not [23:25:15] one more unlickly one: https://gerrit.wikimedia.org/r/#/c/195192/ [23:25:32] 6operations, 7HTTPS, 3HTTPS-by-default: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1119370 (10konklone) If it's helpful, we got a pretty good perspective from the authors of the HPKP spec on how to think about pinning, on this GitHub thread: https://... [23:26:26] the comments on that merge say: (Disclaimer Yuri has told me I can merge this since it is beta labs only and not on production yet) [23:26:30] (on Mar 12) [23:27:18] just click under "included in" [23:28:16] RECOVERY - puppet last run on mw1249 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [23:29:24] 19:20 logmsgbot: rush Synchronized wmf-config/session.php: re-reenable mc1014 (duration: 00m 06s) [23:29:27] huh [23:30:11] yes, that was that 5min site outage we had, remember? [23:30:15] that was the post outage [23:30:16] no [23:30:20] that's the re-reenable [23:30:30] I'm just saying, related. [23:30:30] rush committed twicr [23:30:30] e [23:31:08] one which broke the site, the other that re-enabled mc1014 [23:31:57] This commit: https://gerrit.wikimedia.org/r/#/c/196281/ [23:32:37] maybe-not-related, but looking at different minor performance issue yesterday, ori was musing about the possibility that we have some memcached that are overfull and evicting objects earlier than expected to for bits modules, and/or that we have some kind of memcached split-brain going on somewhere. [23:33:42] (basically, bits js modules are apparently having their timestamps randomly regenerated when they shouldn't be expected to, which is causing small recurrent wavy patterns of traffic as they refresh in bits varnish) [23:37:14] so, the mc1014 work was apparently to move it from one row to another, and change it's IP/vlan in the process [23:37:33] any chance something that connects to it only resolves its hostname once at startup and is still hanging onto the old IP and failing connections? [23:40:22] the thing is [23:40:27] I don't see increased query traffic anywhere [23:40:55] yeah the memcached stats look sane compared to its neighbors [23:41:00] no I mean mysql [23:41:02] (other than fewer bytes stored, which is slowly ramping in still) [23:41:15] oh I'm still looking at mc1014 due to the timing coincidence [23:41:46] just connecting problems