[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150709T0000). [00:00:26] (03PS7) 10Dzahn: Change php5 dependency to libapache2-mod-php5 [puppet] - 10https://gerrit.wikimedia.org/r/223701 (owner: 10Negative24) [00:01:28] (03CR) 10Dzahn: [C: 032] Change php5 dependency to libapache2-mod-php5 [puppet] - 10https://gerrit.wikimedia.org/r/223701 (owner: 10Negative24) [00:02:10] (03CR) 10Dzahn: "this is on phabricator only !" [puppet] - 10https://gerrit.wikimedia.org/r/223701 (owner: 10Negative24) [00:06:00] twentyafterfour: ^ puppet runs fixed on phabricator.. it is applying a couple things that had been waiting due to that [00:07:18] RECOVERY - puppet last run on iridium is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:08:10] 6operations, 6Phabricator, 5Patch-For-Review: iridium (phab server) - Could not find dependency Package[php5] - https://phabricator.wikimedia.org/T105210#1439827 (10Dzahn) 5Open>3Resolved a:3Dzahn https://gerrit.wikimedia.org/r/#/c/223701/7/modules/phabricator/manifests/init.pp --- Info: Applying con... [00:09:24] 6operations, 6Phabricator: iridium (phab server) - Could not find dependency Package[php5] - https://phabricator.wikimedia.org/T105210#1439830 (10Dzahn) [00:10:04] (03PS3) 10Dzahn: ganglia: add aggregator for ulsfo on bast4001 [puppet] - 10https://gerrit.wikimedia.org/r/223231 (https://phabricator.wikimedia.org/T93776) [00:17:51] (03CR) 10Dzahn: [C: 032] ganglia: add aggregator for ulsfo on bast4001 [puppet] - 10https://gerrit.wikimedia.org/r/223231 (https://phabricator.wikimedia.org/T93776) (owner: 10Dzahn) [00:20:29] !log Ran populateContentModel.php --table=revision for odd-numbered namespaces on officewiki for T105245 [00:20:34] Logged the message, Mr. Obvious [00:24:19] PROBLEM - puppet last run on bast4001 is CRITICAL puppet fail [00:24:30] yes, that's me, on it [00:24:41] it tries to use old and new class at the same time [00:24:51] mutante: so I thought my patch was broken when it got V: -1'd but it was just 2sp whitespace :D [00:25:08] Negative24: yes, tabs [00:25:28] (I wonder why vim-fugitive didn't catch it...) [00:25:32] it took a while to remove them all:) [00:25:37] heh [00:25:44] but they weren't all from me [00:25:52] hehe, definitely not [00:28:03] i mean before we even had the jenkins check :) [00:28:11] (03PS1) 10Dzahn: bast4001: use ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/223703 [00:28:33] (03PS2) 10Dzahn: bast4001: use ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/223703 [00:29:24] (03CR) 10Dzahn: [C: 032] bast4001: use ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/223703 (owner: 10Dzahn) [00:30:59] the old error is gone.. new error is there [00:34:59] (03PS1) 10Dzahn: ganglia_new: add aggregator config for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/223705 (https://phabricator.wikimedia.org/T93776) [00:35:19] (03PS2) 10Dzahn: ganglia_new: add aggregator config for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/223705 (https://phabricator.wikimedia.org/T93776) [00:36:24] (03PS3) 10Dzahn: ganglia_new: add aggregator config for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/223705 (https://phabricator.wikimedia.org/T93776) [00:36:49] (03PS4) 10Dzahn: ganglia_new: add aggregator config for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/223705 (https://phabricator.wikimedia.org/T93776) [00:37:50] (03CR) 10Dzahn: [C: 032] ganglia_new: add aggregator config for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/223705 (https://phabricator.wikimedia.org/T93776) (owner: 10Dzahn) [00:41:28] yay, you can recover now [00:41:40] RECOVERY - puppet last run on bast4001 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [00:42:23] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1439901 (10Dzahn) ulsfo now has an aggregator [00:45:04] !log Running populateContentModel.php --wiki=cawiki --table=revision --ns=5 [00:45:09] Logged the message, Mr. Obvious [00:52:45] (03PS1) 10Dzahn: ulsfo mobile caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/223706 [00:55:13] (03PS2) 10Dzahn: ulsfo mobile caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/223706 [00:56:38] (03CR) 10Dzahn: [C: 032] ulsfo mobile caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/223706 (owner: 10Dzahn) [00:56:59] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1439914 (10Gage) logstash1001-1003: These hosts are older than 1004-1006, and run Precise instead of Jessie. Gmond wouldn't stop or start. ``` gage@logstash1002:~$ sudo /usr/s... [01:03:08] PROBLEM - Disk space on uranium is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=78%) [01:06:48] RECOVERY - Disk space on uranium is OK: DISK OK [01:07:46] !log uranium - deleted apache logs older than 90 days [01:07:50] Logged the message, Master [01:28:24] (03PS1) 10Dzahn: Revert "ulsfo mobile caches: switch to ganglia_new" [puppet] - 10https://gerrit.wikimedia.org/r/223709 [01:30:31] (03CR) 10Dzahn: "no luck yet, things look alright on bast4001, it has all the aggregators configured, also on the monitored hosts thing ok, just on ganglia" [puppet] - 10https://gerrit.wikimedia.org/r/223709 (owner: 10Dzahn) [01:31:27] (03CR) 10Dzahn: [C: 032] Revert "ulsfo mobile caches: switch to ganglia_new" [puppet] - 10https://gerrit.wikimedia.org/r/223709 (owner: 10Dzahn) [01:35:50] (03PS1) 10Springle: depool db1037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223710 [01:36:20] (03CR) 10Dzahn: "it resulted in uranium setting it as:" [puppet] - 10https://gerrit.wikimedia.org/r/223709 (owner: 10Dzahn) [01:36:22] (03CR) 10Springle: [C: 032] depool db1037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223710 (owner: 10Springle) [01:36:29] (03Merged) 10jenkins-bot: depool db1037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223710 (owner: 10Springle) [01:37:46] !log springle Synchronized wmf-config/db-eqiad.php: depool db1037 (duration: 00m 11s) [01:37:51] Logged the message, Master [01:38:05] (03PS2) 10Dzahn: few more lint fixes in role classes [puppet] - 10https://gerrit.wikimedia.org/r/222536 [01:39:04] (03CR) 10Dzahn: [C: 032] few more lint fixes in role classes [puppet] - 10https://gerrit.wikimedia.org/r/222536 (owner: 10Dzahn) [01:41:44] (03PS4) 10Dzahn: releases::reprepro: move class into autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/223450 [01:42:36] (03CR) 10Dzahn: [C: 032] releases::reprepro: move class into autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/223450 (owner: 10Dzahn) [01:46:20] (03PS5) 1020after4: [WIP] Phabricator: Create differential puppet role [puppet] - 10https://gerrit.wikimedia.org/r/222987 (https://phabricator.wikimedia.org/T104827) (owner: 10Negative24) [01:46:45] (03PS3) 10Dzahn: phabricator: insert   for some footer strings [puppet] - 10https://gerrit.wikimedia.org/r/223604 [01:47:40] (03CR) 10Dzahn: [C: 032] phabricator: insert   for some footer strings [puppet] - 10https://gerrit.wikimedia.org/r/223604 (owner: 10Dzahn) [01:49:08] !log switched remaining cassandra nodes to JDK8 [01:49:13] Logged the message, Master [01:50:10] (03Abandoned) 10Dzahn: lvs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211343 (owner: 10Dzahn) [01:52:20] 6operations, 10RESTBase: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1439942 (10GWicke) I have now switched the remaining three nodes to JDK8 in order to see if this reduces timeouts further. [01:52:22] (03Abandoned) 10Dzahn: base: Don't use 'ndots: 2' in labs resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [01:52:24] (03CR) 1020after4: Add Phragile module. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [01:53:03] twentyafterfour: that footer thing failed :) [01:53:16] nice [01:53:38] mutante: I want to kill off the whole local settings yaml thing [01:53:55] (03PS1) 10Dzahn: Revert "phabricator: insert   for some footer strings" [puppet] - 10https://gerrit.wikimedia.org/r/223711 [01:54:14] (03PS3) 1020after4: Bump phabricator tag to release/2015-07-08/1 [puppet] - 10https://gerrit.wikimedia.org/r/223697 [01:55:02] >Wikimedia&nbsp;Foundation [01:55:10] twentyafterfour: *nod& [01:55:23] https://gerrit.wikimedia.org/r/#/c/205797/ [01:55:29] (03PS2) 10Dzahn: Revert "phabricator: insert   for some footer strings" [puppet] - 10https://gerrit.wikimedia.org/r/223711 [01:56:09] (03CR) 10Dzahn: [C: 032] Revert "phabricator: insert   for some footer strings" [puppet] - 10https://gerrit.wikimedia.org/r/223711 (owner: 10Dzahn) [01:57:05] We really shouldn't be transforming yaml into json the way we do it [01:57:26] my abandoned patch above would have fixed it but I couldn't get +2 on it [01:58:05] (03CR) 10Dzahn: "@Josve05a unfortunately didn't work - lead to source like "Wikimedia&nbsp;Foundation"" [puppet] - 10https://gerrit.wikimedia.org/r/223604 (owner: 10Dzahn) [01:59:09] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1439945 (10Dzahn) a:5Dzahn>3None [01:59:49] 6operations, 7Database: move tendril to gerrit repo and puppetize cloning - https://phabricator.wikimedia.org/T98816#1439946 (10Dzahn) git::clone has "ensure => latest", we should use that [02:00:50] !log pkg upgrade and restart db1037 [02:00:54] 6operations, 7Icinga, 5Patch-For-Review: monitor HTTP on bromine.eqiad.wmnet - https://phabricator.wikimedia.org/T104948#1439947 (10Dzahn) >>! In T104948#1437383, @Matanya wrote: > do we need to monitor http or https is enough? only http when talking to bromine since https ends at misc-web / nginx [02:00:55] Logged the message, Master [02:02:10] (03CR) 10Dzahn: "when adding this to the backend, host bromine, we technically just speak http. https would be monitoring misc-web (which could just be an " [puppet] - 10https://gerrit.wikimedia.org/r/223364 (https://phabricator.wikimedia.org/T104948) (owner: 10John F. Lewis) [02:03:31] (03Restored) 1020after4: Move maniphest status settings into custom/wmf-defaults.php [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) (owner: 1020after4) [02:07:47] (03PS4) 1020after4: Move maniphest status settings into custom/wmf-defaults.php [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) [02:08:46] (03PS4) 1020after4: Bump phabricator tag to release/2015-07-08/1 [puppet] - 10https://gerrit.wikimedia.org/r/223697 [02:09:07] mutante: can I get +2 on https://gerrit.wikimedia.org/r/#/c/223697/ ? [02:09:25] and maybe even https://gerrit.wikimedia.org/r/#/c/205797/4 ? [02:09:32] (03PS5) 1020after4: Move maniphest status settings into custom/wmf-defaults.php [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) [02:11:02] (03CR) 10Dzahn: [C: 032] Bump phabricator tag to release/2015-07-08/1 [puppet] - 10https://gerrit.wikimedia.org/r/223697 (owner: 1020after4) [02:12:41] twentyafterfour: release tag yes, the other one, not right now please [02:14:17] mutante: just too much to look at? it's fine, it's sat forever it can sit some more :) [02:14:56] thanks for the +2 though, I guess I'm gonna upgrade phabricator a bit late tonight [02:16:52] !log l10nupdate Synchronized php-1.26wmf12/cache/l10n: (no message) (duration: 00m 47s) [02:16:58] Logged the message, Master [02:17:02] !log LocalisationUpdate completed (1.26wmf12) at 2015-07-09 02:17:02+00:00 [02:17:07] Logged the message, Master [02:17:43] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga - https://phabricator.wikimedia.org/T105229#1439956 (10Dzahn) p:5Triage>3Normal [02:18:27] 6operations, 7Graphite, 7HHVM, 7Monitoring: check_graphite - "UNKNOWN: More than half of the datapoints are undefined " - https://phabricator.wikimedia.org/T105218#1439957 (10Dzahn) p:5Triage>3Normal [02:19:16] twentyafterfour: yea, a bit late , gotta leave [02:19:19] cya around [02:23:51] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 00m 34s) [02:23:56] Logged the message, Master [02:24:00] 6operations: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1439961 (10Dzahn) [02:24:01] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-09 02:24:00+00:00 [02:24:05] Logged the message, Master [02:28:21] !log moved phd log to free disk space on iridium [02:28:26] Logged the message, Master [02:28:26] !log restarted phd [02:28:30] Logged the message, Master [02:36:36] !log l10nupdate Synchronized php-1.26wmf12/cache/l10n: (no message) (duration: 10m 32s) [02:36:42] Logged the message, Master [02:40:16] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 9 02:40:16 UTC 2015 (duration 40m 15s) [02:40:21] Logged the message, Master [02:42:56] !log LocalisationUpdate completed (1.26wmf12) at 2015-07-09 02:42:56+00:00 [02:43:01] Logged the message, Master [02:58:31] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 05m 29s) [02:58:37] Logged the message, Master [02:58:47] (03PS1) 10Alex Monk: Block WMF account creation by users who aren't already WMF tagged [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223715 [02:58:51] mutante, ^ [02:59:57] (03CR) 10Alex Monk: "This is untested, but I just hacked it together because I got fed up of the subject of verification on #WMF-NDA-Requests tickets" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223715 (owner: 10Alex Monk) [03:00:08] PROBLEM - puppet last run on elastic1020 is CRITICAL Puppet has 1 failures [03:01:13] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-09 03:01:13+00:00 [03:01:18] Logged the message, Master [03:02:28] (03CR) 10Alex Monk: "A similar restriction was in the global title blacklist briefly back in December 2011, but it got reverted because it broke CA autocreatio" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223715 (owner: 10Alex Monk) [03:05:25] (03CR) 10Alex Monk: "Ah, looks like it went back on after that, and then got reverted again: https://meta.wikimedia.org/w/index.php?title=Title_blacklist&diff=" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223715 (owner: 10Alex Monk) [03:16:49] RECOVERY - puppet last run on elastic1020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [03:21:40] (03CR) 10Alex Monk: [C: 04-2] "There might be a better way than this hack." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223715 (owner: 10Alex Monk) [03:43:30] (03CR) 10MZMcBride: "This seems to be related to ." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223715 (owner: 10Alex Monk) [03:46:09] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [04:02:58] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [04:10:09] PROBLEM - puppet last run on cp4003 is CRITICAL puppet fail [04:26:58] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:48:57] (03PS1) 10GWicke: Reduce the compaction throughput limit to 80mb/s [puppet] - 10https://gerrit.wikimedia.org/r/223722 [04:53:38] (03Abandoned) 10GWicke: Reduce compaction throughput to 110mb/s [puppet] - 10https://gerrit.wikimedia.org/r/223499 (owner: 10GWicke) [04:56:04] (03CR) 10GWicke: [C: 04-1] "I'm actually no longer sure that this still makes sense with the lower 80mb/s throughput limit (https://gerrit.wikimedia.org/r/#/c/223722/" [puppet] - 10https://gerrit.wikimedia.org/r/223457 (owner: 10GWicke) [05:23:09] !log dynamically limited cassandra compaction throughput to 80mb/s; please review https://gerrit.wikimedia.org/r/#/c/223722/ to make this permanent [05:23:13] Logged the message, Master [05:46:37] There's reports of a redirect loop on some browsers (Chrome) for the page https://commons.wikimedia.org/wiki/User_talk:Budsus~idwiki [05:46:49] possibly to do with converting ~ to %7E [05:52:08] 6operations, 6Commons: https://commons.wikimedia.org/wiki/User_talk:Budsus~idwiki is a 301 redirect to the page itself - https://phabricator.wikimedia.org/T105265#1440038 (10Bawolff) [05:52:11] 6operations, 6Commons: https://commons.wikimedia.org/wiki/User_talk:Budsus~idwiki is a 301 redirect to the page itself - https://phabricator.wikimedia.org/T105265#1440041 (10zhuyifei1999) Note: https://commons.wikimedia.org/wiki/?curid=41447650 & https://commons.wikimedia.org/wiki/User_talk:Budsus~idwiki?redir... [06:05:30] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 9 06:05:30 UTC 2015 (duration 5m 29s) [06:05:35] Logged the message, Master [06:15:27] !log db1037 is repartitioning tables; it will lag intermittently for a day [06:15:31] Logged the message, Master [06:16:57] <_joe_> hey sean [06:17:16] evening _joe_ [06:17:34] <_joe_> long time no see :) [06:18:40] :) [06:22:39] vagrant ssh [06:22:42] uh [06:24:10] PROBLEM - puppet last run on mw1191 is CRITICAL Puppet has 1 failures [06:27:22] !log restarted apache2 on iridium to fix phab exception [06:27:25] Logged the message, Master [06:32:29] PROBLEM - puppet last run on cp2014 is CRITICAL Puppet has 1 failures [06:32:58] PROBLEM - puppet last run on cp2001 is CRITICAL Puppet has 1 failures [06:33:04] <_joe_> twentyafterfour: phab exception? [06:33:29] PROBLEM - puppet last run on db1023 is CRITICAL Puppet has 2 failures [06:33:49] PROBLEM - puppet last run on db2065 is CRITICAL Puppet has 1 failures [06:34:18] PROBLEM - puppet last run on db1028 is CRITICAL Puppet has 1 failures [06:34:35] _joe_: https://phabricator.wikimedia.org/T105266 [06:34:59] PROBLEM - puppet last run on cp4004 is CRITICAL Puppet has 2 failures [06:36:18] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:36:19] PROBLEM - puppet last run on mw1042 is CRITICAL Puppet has 1 failures [06:36:19] PROBLEM - puppet last run on mw1052 is CRITICAL Puppet has 1 failures [06:36:39] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:36:48] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures [06:37:39] PROBLEM - puppet last run on mw2212 is CRITICAL Puppet has 1 failures [06:37:39] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:38:38] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:38:39] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:39:05] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1440127 (10Joe) To clarify a bit, what will be needed is: - creating a wdqs-admins group in the admin module - grant access to t... [06:39:59] PROBLEM - puppet last run on mw2056 is CRITICAL Puppet has 1 failures [06:40:39] (03PS4) 10Giuseppe Lavagetto: pybal: refactor pybal::pool, print pools in pybal::web [puppet] - 10https://gerrit.wikimedia.org/r/223523 [06:42:49] RECOVERY - puppet last run on mw1191 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:43:57] (03PS5) 10Giuseppe Lavagetto: pybal: refactor pybal::pool, print pools in pybal::web [puppet] - 10https://gerrit.wikimedia.org/r/223523 [06:46:09] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:46:32] (03PS2) 10GWicke: Reduce the compaction throughput limit to 65mb/s [puppet] - 10https://gerrit.wikimedia.org/r/223722 [06:46:59] RECOVERY - puppet last run on db2065 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:47:00] (03CR) 10Springle: [C: 032 V: 032] s7 pager slave partitioning [software] - 10https://gerrit.wikimedia.org/r/222538 (owner: 10Springle) [06:47:08] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:47:09] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:47:27] (03PS1) 10Springle: s6 pager slave partitioning [software] - 10https://gerrit.wikimedia.org/r/223732 [06:47:28] RECOVERY - puppet last run on db1028 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:47:29] RECOVERY - puppet last run on cp2014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:29] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on mw1042 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:47:39] RECOVERY - puppet last run on mw1052 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:59] RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:00] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:48:00] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:00] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:48:09] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:29] RECOVERY - puppet last run on db1023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:28] RECOVERY - puppet last run on mw2056 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:54:12] (03PS3) 10GWicke: Reduce the compaction throughput limit to 60mb/s [puppet] - 10https://gerrit.wikimedia.org/r/223722 [06:59:25] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: refactor pybal::pool, print pools in pybal::web [puppet] - 10https://gerrit.wikimedia.org/r/223523 (owner: 10Giuseppe Lavagetto) [07:04:04] <_joe_> grr what is wrong with this? [07:05:56] (03PS7) 10Merlijn van Deen: Add url to adminlogbot output [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 [07:05:58] (03PS1) 10Merlijn van Deen: Update debian/changelog and debian/control [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223735 [07:06:57] (03CR) 10Merlijn van Deen: "Rebased, added Ib442d9d386319937b0127063560265b071fb4113 for the debian changes (as I also added a slightly longer description)" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 (owner: 10Merlijn van Deen) [07:08:19] PROBLEM - puppet last run on palladium is CRITICAL puppet fail [07:08:24] (03PS1) 10Giuseppe Lavagetto: pybal: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/223736 [07:09:05] (03CR) 10jenkins-bot: [V: 04-1] pybal: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/223736 (owner: 10Giuseppe Lavagetto) [07:10:11] (03PS2) 10Giuseppe Lavagetto: pybal: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/223736 [07:10:53] (03CR) 10jenkins-bot: [V: 04-1] pybal: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/223736 (owner: 10Giuseppe Lavagetto) [07:13:32] (03PS3) 10Giuseppe Lavagetto: pybal: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/223736 [07:13:40] <_joe_> I'm on a roll, I correct one error and I introduce a new one... [07:14:06] (03CR) 10Josve05a: "Pehaps we could use non-brekable spaces directly and not use html-codes for it? (" " instead of  )" [puppet] - 10https://gerrit.wikimedia.org/r/223604 (owner: 10Dzahn) [07:16:46] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/223736 (owner: 10Giuseppe Lavagetto) [07:16:49] PROBLEM - puppet last run on ganeti2002 is CRITICAL puppet fail [07:25:03] RECOVERY - puppet last run on palladium is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:25:37] 7Puppet, 6operations, 6Discovery, 3Discovery-Wikidata-Query-Service-Sprint, 5Patch-For-Review: Make a puppet role that sets up a query service and loads it - https://phabricator.wikimedia.org/T95679#1440193 (10Joe) [07:34:44] RECOVERY - puppet last run on ganeti2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:55] !log oblivian Synchronized php-1.26wmf13/thumb.php: Re-add fix for thumb.php 404s on HHVM (duration: 00m 13s) [07:49:00] Logged the message, Master [07:51:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [07:52:18] <_joe_> uhm something went wrong? [07:53:25] <_joe_> ok, no [07:53:26] <_joe_> :) [07:53:31] <_joe_> just a hiccup [07:55:04] (03PS1) 10Giuseppe Lavagetto: pybal::web: only generate existing pools [puppet] - 10https://gerrit.wikimedia.org/r/223740 [07:55:34] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] pybal::web: only generate existing pools [puppet] - 10https://gerrit.wikimedia.org/r/223740 (owner: 10Giuseppe Lavagetto) [08:05:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [08:08:58] (03PS1) 10KartikMistry: Update Campaigns config as per 223387 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223742 [08:09:54] (03CR) 10Muehlenhoff: [C: 04-1] "There's a few issues in the check_conntrack.py script:" [puppet] - 10https://gerrit.wikimedia.org/r/223560 (owner: 10Matanya) [08:10:34] (03CR) 10Nikerabbit: [C: 04-1] Update Campaigns config as per 223387 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223742 (owner: 10KartikMistry) [08:12:29] (03PS1) 10Giuseppe Lavagetto: pybal::web: add the pybal syntax checker [puppet] - 10https://gerrit.wikimedia.org/r/223745 [08:21:27] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal::web: add the pybal syntax checker [puppet] - 10https://gerrit.wikimedia.org/r/223745 (owner: 10Giuseppe Lavagetto) [08:23:32] 6operations, 10Wikimedia-Mailing-lists, 7Mail: Spam solutions for Education-l mailing list - https://phabricator.wikimedia.org/T100428#1440378 (10Selsharbaty-WMF) Hi John, I will close it now. I have applied a new filter blocking messages with certain common spam words in the subject. It worked and no spam... [08:24:11] 6operations, 10Wikimedia-Mailing-lists, 7Mail: Spam solutions for Education-l mailing list - https://phabricator.wikimedia.org/T100428#1440381 (10Selsharbaty-WMF) 5Open>3Resolved [08:25:31] (03Abandoned) 10Matanya: add soundcloud to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223516 (https://phabricator.wikimedia.org/T105052) (owner: 10Matanya) [08:27:55] (03PS2) 10KartikMistry: Update Campaigns config as per 223387 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223742 [08:43:37] (03PS1) 10Giuseppe Lavagetto: pybal: cosmetic fixes to the confd template [puppet] - 10https://gerrit.wikimedia.org/r/223748 [08:50:36] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: cosmetic fixes to the confd template [puppet] - 10https://gerrit.wikimedia.org/r/223748 (owner: 10Giuseppe Lavagetto) [08:54:06] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1106873 (10hashar) We still have that the Gerrit patch applied on the `integration` labs project. F... [08:54:49] (03CR) 10Hashar: "That is still applied on the integration labs project. I have filled T105297 to remove it and reenable ndots:2" [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [08:57:13] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1440518 (10hashar) >>! In T92351#1439472, @scfc wrote: > Too bad for the readers Google will bring... [09:08:03] (03CR) 10Hashar: [C: 04-1] "It is not on Precise. So we need to prevent the puppet class from landing on Precise :-(" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223691 (https://phabricator.wikimedia.org/T103039) (owner: 10Dduvall) [09:09:04] (03CR) 10Zfilipin: [C: 04-1] contint: Install chromedriver for running MW-Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/223691 (https://phabricator.wikimedia.org/T103039) (owner: 10Dduvall) [09:09:53] (03PS3) 10Matanya: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 [09:10:36] (03CR) 10jenkins-bot: [V: 04-1] monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 (owner: 10Matanya) [09:11:08] (03PS5) 10Hashar: contint: PIL 1.1.7 expects libs in /usr/lib [puppet] - 10https://gerrit.wikimedia.org/r/216307 (https://phabricator.wikimedia.org/T101550) [09:11:23] (03CR) 10Hashar: [C: 031 V: 032] "Rebased" [puppet] - 10https://gerrit.wikimedia.org/r/216307 (https://phabricator.wikimedia.org/T101550) (owner: 10Hashar) [09:11:25] (03PS4) 10Matanya: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 [09:11:51] (03PS7) 10Hashar: contint: Rename 'qunit' localvhost to 'worker' [puppet] - 10https://gerrit.wikimedia.org/r/220666 (https://phabricator.wikimedia.org/T103766) (owner: 10Krinkle) [09:11:59] (03CR) 10Hashar: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/220666 (https://phabricator.wikimedia.org/T103766) (owner: 10Krinkle) [09:12:31] good morning! [09:13:12] I could please use merge for a couple contint puppet patches that only impact labs and already cherry picked there: https://gerrit.wikimedia.org/r/#/c/216307/ (some /lib/ symlink hacks) and https://gerrit.wikimedia.org/r/#/c/220666/ (apache config tweaks) [09:19:51] (03CR) 10Jcrespo: [C: 032] contint: PIL 1.1.7 expects libs in /usr/lib [puppet] - 10https://gerrit.wikimedia.org/r/216307 (https://phabricator.wikimedia.org/T101550) (owner: 10Hashar) [09:20:14] \O/ [09:20:25] that is really a ugly hack I am not proud of but at least it is in puppet now :-} [09:21:39] (03PS8) 10Jcrespo: contint: Rename 'qunit' localvhost to 'worker' [puppet] - 10https://gerrit.wikimedia.org/r/220666 (https://phabricator.wikimedia.org/T103766) (owner: 10Krinkle) [09:23:02] (03CR) 10Jcrespo: [C: 032] contint: Rename 'qunit' localvhost to 'worker' [puppet] - 10https://gerrit.wikimedia.org/r/220666 (https://phabricator.wikimedia.org/T103766) (owner: 10Krinkle) [09:32:31] and that concludes the review request I sent to ops list. Congrats [09:33:21] (03PS1) 10Muehlenhoff: Optionally disable connection tracking per service [puppet] - 10https://gerrit.wikimedia.org/r/223751 [09:45:43] (03CR) 10Mobrovac: [C: 031] "Strong +1 from me. Good find, Gabriel! Indeed, I remember when we said we should lower this once the initial import of wikis was done." [puppet] - 10https://gerrit.wikimedia.org/r/223722 (owner: 10GWicke) [09:56:03] PROBLEM - Host upload-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:57:48] what's up with upload-lb ? [09:59:07] no images at commons/enwiki [09:59:42] Christian75: yes, we know, we are investigating. thanks for reporting [10:01:23] RECOVERY - Host upload-lb.esams.wikimedia.org is UPING OK - Packet loss = 0%, RTA = 89.45 ms [10:10:54] (03CR) 10Mobrovac: [C: 04-1] "Agreed. We should try lowering the throughput first" [puppet] - 10https://gerrit.wikimedia.org/r/223457 (owner: 10GWicke) [10:12:41] still having issues with upload.wikimedia.org ? I'm only getting cached images right now. [10:13:03] PROBLEM - Host upload-lb.esams.wikimedia.org is DOWN: CRITICAL - Time to live exceeded (91.198.174.208) [10:13:15] ah yes. There we go. [10:13:34] <_joe_> akosiaris_: found anything? [10:13:59] (03CR) 10Mobrovac: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/219811 (https://phabricator.wikimedia.org/T98851) (owner: 10Muehlenhoff) [10:14:19] _joe_: lvs3002 seems to be the culprit but not sure why yes [10:14:50] <_joe_> I see the idleconnection messages coming through in batches [10:14:55] (03CR) 10Mobrovac: [C: 031] Enable firejail for graphoid [puppet] - 10https://gerrit.wikimedia.org/r/219801 (https://phabricator.wikimedia.org/T103095) (owner: 10Muehlenhoff) [10:15:32] UK based- slow response on images.... [10:15:39] You workiong on something? [10:15:57] (03CR) 10Mobrovac: "Needs rebase, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/219331 (https://phabricator.wikimedia.org/T101870) (owner: 10Muehlenhoff) [10:16:09] <_joe_> akosiaris_: I see no inactiveconnections in ipvsadm for upload [10:16:22] <_joe_> but a ton of active ones [10:17:15] (03CR) 10Muehlenhoff: [C: 04-1] "Looks much better. Two further remarks:" [puppet] - 10https://gerrit.wikimedia.org/r/223560 (owner: 10Matanya) [10:17:31] _joe_: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=vl100-eth0.lvs3002.esams.wmnet&m=bytes_in&s=by+name&mc=2&g=network_report&c=LVS+loadbalancers+esams [10:17:55] <_joe_> akosiaris_: uhm that doesn't look right [10:18:11] <_joe_> akosiaris_: let's try to swithc to the backup for now? [10:18:15] yes [10:18:18] <_joe_> i.e. restart pybal [10:18:21] <_joe_> I'll do it [10:18:32] done [10:18:33] not restart [10:18:34] stop [10:18:37] <_joe_> ok [10:18:54] (03CR) 10Mobrovac: "time to abandon this?" [puppet] - 10https://gerrit.wikimedia.org/r/196335 (https://phabricator.wikimedia.org/T92560) (owner: 10Eevans) [10:19:32] <_joe_> akosiaris_: no connections incoming for upload, wtf? [10:20:53] PROBLEM - Cassandra database on restbase1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [10:21:23] PROBLEM - Cassanda CQL query interface on restbase1001 is CRITICAL: Connection refused [10:21:26] <_joe_> akosiaris_: can you please join #security? [10:21:34] _joe_: my bouncer is down [10:21:36] invite me [10:21:46] <_joe_> oh right [10:21:54] of all the days... [10:22:14] <_joe_> invited [10:22:40] 6operations: upload.wikimedia.org down - https://phabricator.wikimedia.org/T105304#1440640 (10Steinsplitter) 3NEW [10:22:44] * mobrovac on cassandra rb1001 [10:24:05] 6operations: upload.wikimedia.org down - https://phabricator.wikimedia.org/T105304#1440648 (10Matanya) upload-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% and http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=vl100-eth0.lvs3002.esams.wmnet&m=bytes_in&s=by+name&mc=2&g=network... [10:24:52] RECOVERY - Cassandra database on restbase1001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [10:24:53] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [10:25:14] RECOVERY - Cassanda CQL query interface on restbase1001 is OK: TCP OK - 0.009 second response time on port 9042 [10:25:18] (03CR) 10Joal: [C: 031] Add flag --all-projects to projectviews aggregator [puppet] - 10https://gerrit.wikimedia.org/r/223573 (https://phabricator.wikimedia.org/T95339) (owner: 10Mforns) [10:30:21] (03PS5) 10Matanya: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 [10:30:42] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /home/hashar/mount is not accessible: Permission denied [10:31:28] Hm.. getting lots of time outs from upload.wikimedia.org. Many articles with broken images [10:31:42] ah yeah [10:31:43] :) [10:33:42] 6operations, 7user-notice: upload.wikimedia.org down - https://phabricator.wikimedia.org/T105304#1440669 (10Matanya) [10:35:01] <_joe_> Krinkle: we know, sadly [10:35:19] 6operations: Tweak sysctl settings for nf_conntrack - https://phabricator.wikimedia.org/T105307#1440671 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [10:37:58] !log Shutdown AMS-IX route server BGP sessions on cr1-esams [10:38:03] Logged the message, Master [10:38:23] RECOVERY - Host upload-lb.esams.wikimedia.org is UPING OK - Packet loss = 0%, RTA = 88.80 ms [10:38:26] PROBLEM - Cassandra database on restbase1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [10:38:53] PROBLEM - Cassanda CQL query interface on restbase1001 is CRITICAL: Connection refused [10:39:05] hashar: wrt your mail to ops@, you wrote "All the patches/packages have been acted upon in a breeze": but AFAICS https://phabricator.wikimedia.org/T102106 isn't resolved yet, or is it missing a status update? [10:40:42] <_joe_> Krinkle: should be ok now? [10:41:12] confirmed here _joe_ [10:41:52] moritzm: yeah jenkins-debian-glue probably still needs to be uploaded for Jessie [10:42:07] moritzm: but since you told me you would eventually deal with it. I considered it fixed for the purpose of my mail :-} [10:42:35] _joe_: yup, ok [10:42:39] moritzm: it is not urgent, just need someone to eventually deal with it :} [10:42:40] bbl gotta lunch [10:44:33] RECOVERY - Cassandra database on restbase1001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [10:44:53] RECOVERY - Cassanda CQL query interface on restbase1001 is OK: TCP OK - 0.001 second response time on port 9042 [10:45:49] hasharLunch: ok, I see :-) [10:48:08] (03PS1) 10BBlack: depool only upload-lb in esams [dns] - 10https://gerrit.wikimedia.org/r/223757 [10:48:53] (03CR) 10BBlack: [C: 032 V: 032] depool only upload-lb in esams [dns] - 10https://gerrit.wikimedia.org/r/223757 (owner: 10BBlack) [10:49:14] (03PS1) 10Matanya: sysctl: tweak sysctl settings for nf_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/223758 [10:53:07] !log restbase started thinner 15 days for wikimedia group [10:53:11] Logged the message, Master [10:54:08] _joe_: want to comment/close/change status of https://phabricator.wikimedia.org/T105304 ? [10:54:42] <_joe_> matanya: 1 sec [10:55:15] thanks [10:57:11] (03CR) 10Alexandros Kosiaris: [C: 04-2] "I 'd rather we did not do that. The approach in https://gerrit.wikimedia.org/r/223751 is preferable (a per service disabling of NOTRACK). " [puppet] - 10https://gerrit.wikimedia.org/r/223758 (owner: 10Matanya) [10:58:25] 6operations, 7user-notice: upload.wikimedia.org down - https://phabricator.wikimedia.org/T105304#1440712 (10Joe) We think we solved the issue, but we temporarily set the esams DC offline for upload while we investigate the problem further. The public-facing issue should be solved since 10:38 UTC [10:58:40] 6operations, 7user-notice: upload.wikimedia.org down - https://phabricator.wikimedia.org/T105304#1440715 (10Joe) 5Open>3Resolved [11:04:02] PROBLEM - Cassandra database on restbase1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [11:04:22] PROBLEM - Cassanda CQL query interface on restbase1001 is CRITICAL: Connection refused [11:07:08] looking at restbase1001 [11:07:28] godog: we need to take a closer look there [11:07:39] it's died 2 times in the last hour [11:08:13] indeed, I'm looking at the logs and grafana [11:09:39] !log restbase deploying https://gerrit.wikimedia.org/r/#/c/223297/ which bumps the back-end module version ( https://github.com/wikimedia/restbase-mod-table-cassandra/pull/117 ) [11:09:44] Logged the message, Master [11:18:51] Anyone deploying? [11:19:15] Wanna get a fix out for redirect loop for pages with ~ in their title https://gerrit.wikimedia.org/r/#/c/223756/ [11:21:24] !log krinkle Synchronized php-1.26wmf13/includes/GlobalFunctions.php: T105265 (duration: 00m 12s) [11:21:29] Logged the message, Master [11:21:38] mobrovac: puppet is enabled across the board on restbase, looking at the disk io spike [11:22:03] !log krinkle Synchronized php-1.26wmf13/resources/src/mediawiki/mediawiki.util.js: T105265 (duration: 00m 11s) [11:22:07] Logged the message, Master [11:24:41] godog: no visible diff in the config on rb1001 [11:30:32] godog: we'd need to start cass on rb1001 soon, it's been out for 30 mins now [11:32:35] (03Abandoned) 10Matanya: sysctl: tweak sysctl settings for nf_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/223758 (owner: 10Matanya) [11:34:06] mobrovac: yeah I'll restart it now, I think cassandra tried many writes to the ssd and they freaked out [11:34:15] well, not freaked out but didn't like it [11:34:34] !log restart cassandra on restbase1001 [11:34:38] Logged the message, Master [11:34:54] RECOVERY - Cassandra database on restbase1001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [11:34:58] hm, could be a side effect of raising the number of concurrent writes [11:35:14] godog: but why is only rb1001 freaking out? [11:35:24] godog: maybe do a disk check there? [11:37:12] mobrovac: you can see it here, lots of writes and lots of written bytes https://graphite.wikimedia.org/render/?width=964&height=422&_salt=1436441810.934&from=-3hours&target=sumSeries(servers.restbase1001.iostat.sd*.writes)&target=secondYAxis(sumSeries(servers.restbase1001.iostat.sd*.writes_byte)) [11:37:12] RECOVERY - Cassanda CQL query interface on restbase1001 is OK: TCP OK - 0.004 second response time on port 9042 [11:38:16] yup [11:41:04] (03CR) 10Santhosh: [C: 031] Update Campaigns config as per 223387 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223742 (owner: 10KartikMistry) [11:46:44] mobrovac: I'm heading to lunch, will revisit later the latest code reviews but it looks like too much data was written at the same time [11:47:46] kk godog, buon appetito [11:47:48] (03PS3) 10Muehlenhoff: Enable firejail for mathoid [puppet] - 10https://gerrit.wikimedia.org/r/219331 (https://phabricator.wikimedia.org/T101870) [11:53:33] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [11:53:53] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [11:54:39] on it ^^ [11:57:43] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [11:59:23] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.015 second response time on port 9042 [12:07:33] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [12:09:13] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [12:10:09] mobrovac: this one too ^ ? [12:10:27] *sigh* [12:10:30] same node died again [12:10:38] thnx matanya [12:10:40] * mobrovac on it [12:11:15] Sorry I had to make the "u" uppercase :-) [12:13:23] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [12:14:53] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.002 second response time on port 9042 [12:17:41] jynus: hey! can you check the SQL bits on https://gerrit.wikimedia.org/r/#/c/223564/? [12:17:51] the LDAP bits need a little more completion but hopefully not too much more. [12:19:42] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [12:22:52] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [12:23:13] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [12:26:12] oh come on [12:28:13] (03CR) 10Mobrovac: "First-round comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [12:29:02] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [12:29:20] euh? [12:29:27] who restarted cassandra on rb1004? [12:30:33] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.001 second response time on port 9042 [12:33:52] <_joe_> mobrovac: heh, It's still not ready for review, sorry [12:38:12] 6operations, 7user-notice: upload.wikimedia.org down - https://phabricator.wikimedia.org/T105304#1441023 (10Aklapper) [12:38:27] _joe_: np, just wanted to point out some things (i see it's a WIP) [12:50:43] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /tmp/image.nJC7L7DK/mnt is not accessible: Permission denied [12:52:43] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [12:53:13] PROBLEM - Cassanda CQL query interface on restbase1003 is CRITICAL: Connection refused [12:53:33] PROBLEM - Cassandra database on restbase1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [12:53:58] * mobrovac is thinking about writing a bot for restarting cassandra :P [12:54:59] (03CR) 10Muehlenhoff: [C: 04-1] "It's getting there, but some additional remarks:" [puppet] - 10https://gerrit.wikimedia.org/r/223560 (owner: 10Matanya) [12:57:32] RECOVERY - Cassandra database on restbase1003 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [12:59:03] RECOVERY - Cassanda CQL query interface on restbase1003 is OK: TCP OK - 0.002 second response time on port 9042 [13:02:25] !log upgrade python-django on graphite1001 and graphite2001 following http://www.ubuntu.com/usn/usn-2671-1/ [13:02:29] Logged the message, Master [13:10:40] YuviPanda, I have made a small comment to gerrit:223564 regarding searching existent users [13:11:48] from my side I am ok with it (right grants) [13:12:13] (03PS6) 10Matanya: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 [13:16:27] (03CR) 10Jcrespo: [C: 031] "The grants are syntactically correct and correct from the MySQL point of view (I have manually checked), you have my +1, but Coren should " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223564 (owner: 10Yuvipanda) [13:19:18] (03PS1) 10Faidon Liambotis: Revert "depool only upload-lb in esams" [dns] - 10https://gerrit.wikimedia.org/r/223772 [13:20:20] (03CR) 10BBlack: [C: 031] Revert "depool only upload-lb in esams" [dns] - 10https://gerrit.wikimedia.org/r/223772 (owner: 10Faidon Liambotis) [13:20:31] (03CR) 10Faidon Liambotis: [C: 032] Revert "depool only upload-lb in esams" [dns] - 10https://gerrit.wikimedia.org/r/223772 (owner: 10Faidon Liambotis) [13:21:22] godog: i think we should lower concurrent_writes introduced by https://gerrit.wikimedia.org/r/223454 [13:21:41] godog: 128 is probably too high given we've got RAID1 there [13:21:45] (right?) [13:21:58] mobrovac: nope all raid0 [13:22:03] ah [13:22:16] godog: btw, rb1003 died with the same symptoms while you were away [13:22:49] sigh, yeah I agree we should try lowering it by say 25% [13:23:17] (03CR) 10Jcrespo: "Sorry I meant including and Host = %s" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223564 (owner: 10Yuvipanda) [13:23:35] godog: yup, 96 is what i had in mind [13:23:38] * mobrovac preparing a patch [13:23:54] (03PS14) 10Alexandros Kosiaris: beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) (owner: 10Dduvall) [13:24:16] (03CR) 10Alexandros Kosiaris: "FYI, running a puppet catalog compiler test" [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) (owner: 10Dduvall) [13:25:18] mobrovac: also urandom and gwicke will be online soon, I'm looking into cassandra options to see if we can throttle the write bandwidth [13:27:10] (03PS1) 10Mobrovac: Cassandra: Decrease concurrent_writes to 96 [puppet] - 10https://gerrit.wikimedia.org/r/223774 [13:27:18] godog: ^^ [13:27:29] 6operations, 7HTTPS, 7LDAP: update ldap-mirror.wikimedia.org certificate to sha256 - https://phabricator.wikimedia.org/T105187#1441171 (10akosiaris) As @RobH points out ``` openssl x509 -in /home/alex/wikimedia/gerrit/puppet/production/files/ssl/ldap-mirror.wikimedia.org.crt -issuer issuer= /C=US/ST=Califor... [13:30:42] mobrovac: yup, the 100MB/s figure seems a bit low? more like 500MB/s per disk [13:32:16] godog: http://grafana.wikimedia.org/#/dashboard/db/restbase-cassandra-system?panelId=8&fullscreen suggests 100, but I have a feeling something's not good in that graph [13:32:54] (03PS1) 10BBlack: update to openssl-1.0.2d [debs/openssl] (upstream) - 10https://gerrit.wikimedia.org/r/223775 [13:33:47] (03CR) 10BBlack: [C: 032 V: 032] "Validated against separate comparison of 1.0.2c and d tarballs, same diff." [debs/openssl] (upstream) - 10https://gerrit.wikimedia.org/r/223775 (owner: 10BBlack) [13:35:39] (03CR) 10Matthias Mullie: [C: 04-1] "new_flow_page.rb assumes Flow_test_talk namespace exists. Should probably change that one back to Talk when removing these." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223453 (https://phabricator.wikimedia.org/T104279) (owner: 10Mattflaschen) [13:36:05] (03CR) 10Matthias Mullie: "new_flow_page.rb is one of the browser tests, btw :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223453 (https://phabricator.wikimedia.org/T104279) (owner: 10Mattflaschen) [13:36:48] mobrovac: I was misreading a graph I pasted earlier, that above looks right, double checking [13:38:32] (03PS3) 10Jcrespo: Populate labsdb1004 with mariadb [puppet] - 10https://gerrit.wikimedia.org/r/218874 (https://phabricator.wikimedia.org/T88718) [13:38:44] (03PS1) 10BBlack: Merge branch 'upstream' [debs/openssl] - 10https://gerrit.wikimedia.org/r/223776 [13:39:01] (03CR) 10BBlack: [C: 032 V: 032] Merge branch 'upstream' [debs/openssl] - 10https://gerrit.wikimedia.org/r/223776 (owner: 10BBlack) [13:39:57] (03PS14) 10Hashar: nodepool: preliminary role and config file [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) [13:41:01] (03CR) 10Jcrespo: [C: 031] "I do not know if to leave the puppet file with 2 roles for clarity or put a set of ifs, -1 or +1 what you prefer and I will do as commande" [puppet] - 10https://gerrit.wikimedia.org/r/218874 (https://phabricator.wikimedia.org/T88718) (owner: 10Jcrespo) [13:41:29] (03CR) 10Hashar: "The Jenkins ssh private key is no provided via a puppet source: puppet:///private/nodepool/dib_jenkins_id_rsa" [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [13:42:55] regarding labsdb1004, one thing that could also be done is a delayed replication, which would allow a 1 day buffer for operator errors. [13:43:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Cassandra: Decrease concurrent_writes to 96 [puppet] - 10https://gerrit.wikimedia.org/r/223774 (owner: 10Mobrovac) [13:43:45] godog: run puppet on the nodes, i'll roll-restart afterwards [13:44:52] mobrovac: yeah it is running [13:44:58] cool [13:45:20] YuviPanda: can you re-link me to the etherpad/wiki/whatever for doing a manual backup of labstore? [13:46:39] andrewbogott: sure, moment [13:46:50] jynus: thanks for the cr! [13:46:53] I'll amend [13:47:11] for me it looks great! [13:47:25] thank you for taking the time [13:48:33] !log restbase cassandra rolling restart to apply https://gerrit.wikimedia.org/r/223774 [13:48:34] jynus: yw! [13:48:36] andrewbogott: http://etherpad.wikimedia.org/p/lvm-labstore-backups [13:48:38] Logged the message, Master [13:48:45] YuviPanda: thanks! [13:48:51] jynus: re: labstore1004/5, I don't think delayed replication would be of much use... [13:49:07] BTW, YuviPanda I fixed existing grants transparently [13:49:14] w00t, I saw. [13:49:16] thanks jynus :) [13:49:22] but I have still to do a manual review [13:49:31] is 10.1 on the horizon? [13:49:54] YuviPanda, to be fair not really, too unstable yet [13:50:03] right :( [13:50:15] the problem is that 10 already has roles [13:50:28] but it requires user action, which we cannot afford [13:50:50] right, yeah [13:50:53] unfortunatley. [13:50:53] maybe backport the future, but it is not easy [13:50:57] *feature [13:50:58] right. [13:51:10] definitelly not in short-time actions [13:51:13] is 10.1 even scheduled to be out at some point in the future? [13:51:15] X months? [13:51:18] or is it years? [13:51:25] not even scheduled to be released [13:51:32] heh right [13:51:32] imagine to become stable :-) [13:51:56] 5.7 will be in months [13:52:05] (03PS15) 10Hashar: nodepool: preliminary role and config file [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) [13:52:12] but his "role implementation" also requires user action [13:52:26] mobrovac: done, go for it [13:52:28] (03CR) 10Hashar: "Resync nodepool.yaml template with latest changes." [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [13:52:32] I agree that is painful [13:52:40] godog: already in the process of restart :P [13:52:48] (03PS16) 10Hashar: nodepool: preliminary role and config file [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) [13:52:56] figured it would take only a couple of mins to apply that [13:53:00] specially when I had a nice "replicas role" where most of your problems will go away [13:53:43] (03PS2) 10Yuvipanda: labstore: Escape grants properly [puppet] - 10https://gerrit.wikimedia.org/r/222265 (https://phabricator.wikimedia.org/T101758) [13:53:50] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Escape grants properly [puppet] - 10https://gerrit.wikimedia.org/r/222265 (https://phabricator.wikimedia.org/T101758) (owner: 10Yuvipanda) [13:54:09] jynus: heh, yeah :( [13:54:10] oh well [13:54:22] my philosophy [13:54:42] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:54:53] lets get a patch (as I have also been doing on sanitarium) as soon a possible [13:55:06] then we can start about long term things [13:56:46] (03PS1) 10Hashar: nodepool: element with basic networking packages [puppet] - 10https://gerrit.wikimedia.org/r/223777 (https://phabricator.wikimedia.org/T105152) [13:57:49] (03PS8) 10Andrew Bogott: Add a labsproject fact that doesn't rely on ldap config. [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) [13:57:51] (03PS5) 10Andrew Bogott: Use the labsproject fact rather than $::instanceproject from ldap [puppet] - 10https://gerrit.wikimedia.org/r/221562 [13:59:37] yup [14:02:25] (03PS7) 10Matanya: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 [14:08:23] RECOVERY - DPKG on labmon1001 is OK: All packages OK [14:09:51] (03PS8) 10Matanya: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 [14:14:24] (03PS1) 10Yuvipanda: labstore: Escape _s properly [puppet] - 10https://gerrit.wikimedia.org/r/223780 [14:14:26] (03PS1) 10Yuvipanda: labstore: Remove wildcard grants [puppet] - 10https://gerrit.wikimedia.org/r/223781 [14:14:34] (03CR) 10jenkins-bot: [V: 04-1] labstore: Remove wildcard grants [puppet] - 10https://gerrit.wikimedia.org/r/223781 (owner: 10Yuvipanda) [14:15:23] (03PS2) 10Yuvipanda: labstore: Remove wildcard grants [puppet] - 10https://gerrit.wikimedia.org/r/223781 [14:15:28] (03CR) 10jenkins-bot: [V: 04-1] labstore: Remove wildcard grants [puppet] - 10https://gerrit.wikimedia.org/r/223781 (owner: 10Yuvipanda) [14:16:27] (03PS3) 10Yuvipanda: labstore: Remove wildcard grants [puppet] - 10https://gerrit.wikimedia.org/r/223781 [14:17:09] Coren: ^ removes the wildcard grant for all tables... [14:17:37] Coren: if I merge this can you babysit a run? I don't understand the script well enough and its local dependencies to be ok running it myself. [14:19:50] anyone know where that bug was for urls and text merging together? [14:19:55] on phab [14:20:24] 6operations, 6Services: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#1441330 (10Joe) ocg1003 is offline since months, and we really ought to re-commission it. I see the server was never reinstalled, in fact it's just grabbing jobs from the jobqueue and not getting any traffic from the loa... [14:20:52] 6operations, 6Services: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#1441332 (10Joe) 5stalled>3Open p:5Normal>3Unbreak! a:3Dzahn [14:24:37] come on db1003, you can do it! [14:24:43] !log really upgrade python-django on graphite2001 [14:24:48] Logged the message, Master [14:25:03] RECOVERY - DPKG on db1003 is OK: All packages OK [14:25:20] it helps to yell [14:27:15] (03PS1) 10BBlack: Release 1.0.2d-1~wmf1 [debs/openssl] - 10https://gerrit.wikimedia.org/r/223786 [14:28:22] !log installed python-django security updates on labmon, netmon and californium [14:28:26] Logged the message, Master [14:29:54] <_joe_> !log reimaging mw1152 for wiping any leftover local hacks. Depooling, scheduling downtime [14:29:58] Logged the message, Master [14:30:09] (03PS1) 10Alexandros Kosiaris: Sign ldap-mirror.wikimedia.org with SHA256 [puppet] - 10https://gerrit.wikimedia.org/r/223788 (https://phabricator.wikimedia.org/T105187) [14:32:00] (03CR) 10Yuvipanda: [C: 032] labstore: Remove wildcard grants [puppet] - 10https://gerrit.wikimedia.org/r/223781 (owner: 10Yuvipanda) [14:32:08] (03PS4) 10Aaron Schulz: Set $wgMainStash to redis instead of the DB default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221885 (https://phabricator.wikimedia.org/T88493) [14:33:14] Krenair: have a few minutes to help me understand https://phabricator.wikimedia.org/T102993? [14:34:08] andrewbogott, hey [14:34:25] andrewbogott, what failures did it cause months ago? [14:34:40] I don’t remember :( [14:34:58] Right now I’m looking at the mw config files and don’t see a switch for nutcracker. So it must be configured elsewhere? [14:36:19] andrewbogott, in wmf-config/mc.php I think? [14:36:39] that's where it points to nutcracker [14:37:03] hm… so looks like it isn’t switchable at all, huh? [14:37:13] Well, that also depends on HHVM_VERSION and silver doesn’t use hhvm [14:37:27] 6operations, 7HTTPS, 7LDAP, 5Patch-For-Review: update ldap-mirror.wikimedia.org certificate to sha256 - https://phabricator.wikimedia.org/T105187#1441382 (10akosiaris) Documentation in: https://wikitech.wikimedia.org/wiki/WMF_CA Merging puppet change and restarting ldap servers to pick up the change [14:37:33] yeah, I wonder if the localhost port is actually nutcracker [14:37:40] (03PS2) 10Alexandros Kosiaris: Sign ldap-mirror.wikimedia.org with SHA256 [puppet] - 10https://gerrit.wikimedia.org/r/223788 (https://phabricator.wikimedia.org/T105187) [14:37:47] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Sign ldap-mirror.wikimedia.org with SHA256 [puppet] - 10https://gerrit.wikimedia.org/r/223788 (https://phabricator.wikimedia.org/T105187) (owner: 10Alexandros Kosiaris) [14:37:56] 127.0.0.1:11212 <- that looks like ‘fall through to memcached’ but that doesn’t seem to be happening. [14:38:01] Let me see what’s running on silver [14:38:30] yeah, nutcracker is on 127.0.0.1:11212 [14:39:01] and memc is on 11000 [14:39:31] so maybe I should leave the mw config alone and just remove nutcracker on silver and move memc to 11212 [14:40:19] andrewbogott: nutcracker is a proxy for memcached [14:40:20] https://github.com/twitter/twemproxy [14:40:43] not the best software btw, but we do use it in production [14:40:53] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1441395 (10scfc) I don't mind if ops wouldn't have investigated the issue. But doing so, claiming... [14:40:55] akosiaris: silver is running one nutcracker and one memcached. So we’ve been thinking that we should rip it out and just use memc directly. [14:41:04] Since nutcracker has been failing and causing problems. [14:41:05] andrewbogott, https://gerrit.wikimedia.org/r/#/c/158138/7/wmf-config/CommonSettings.php and then https://gerrit.wikimedia.org/r/#/c/158238/1/wmf-config/CommonSettings.php [14:41:10] andrewbogott: yeah, that probably makes more sense [14:41:31] it's not like nutcracker makes sense with a single backend memcached [14:41:55] (03CR) 10BBlack: [C: 032 V: 032] Release 1.0.2d-1~wmf1 [debs/openssl] - 10https://gerrit.wikimedia.org/r/223786 (owner: 10BBlack) [14:42:50] Krenair: ah, so it used to be configrable but is no longer. [14:43:40] you disabled our memcached setup for wikitech, and it got reverted shortly after [14:44:10] !log reprepro: jessie-wikimedia/backports openssl pkg, 1.0.2c-1 => 1.0.2d-1~wmf1 [14:44:14] !log re-enabled compaction throttling (60mb/s) on cassandra nodes [14:44:15] Logged the message, Master [14:44:18] Logged the message, Master [14:44:49] So reverting https://gerrit.wikimedia.org/r/#/c/158238 /should/ do what we want, except for we tried it and it didn’t [14:47:52] PROBLEM - Host mw1152 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:22] Krenair: what does it mean to not set things like wgMainCacheType? [14:48:27] _joe_: once imagescalers are done, videoscalers will be next? and does that mean it will unblock vp9 which comes with newer ffmpeg ? [14:48:50] <_joe_> respectively yes, and we'll see [14:48:54] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1441425 (10coren) >>! In T92351#1441395, @scfc wrote: > But doing so, claiming that `dnsmasq` is at... [14:50:01] andrewbogott, I think that's the default object cache type used? [14:50:45] _joe_: btw, might be a good idea to rename tmh* to mw* when that happens [14:51:00] <_joe_> paravoid: it's my plan [14:51:04] awesome :) [14:51:13] <_joe_> I was given some controversy for that though [14:51:50] (03Abandoned) 10Faidon Liambotis: move cassandra submodule into puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/196335 (https://phabricator.wikimedia.org/T92560) (owner: 10Eevans) [14:52:04] Krenair: I see a file here named mc-labs.php which doesn’t seem to get pulled in anyplace. Do you object to me reviving that? (It has an obvious mistake which I will fix...) [14:52:17] andrewbogott, that's for deployment-prep [14:52:26] ah [14:52:43] * andrewbogott grumbles about beta vs. deployment-prep vs. labs [14:52:52] I believe it would be included by require( getRealmSpecificFilename( "$wmfConfigDir/mc.php" ) ); [14:52:58] as per the readme [14:54:22] RECOVERY - Host mw1152 is UPING OK - Packet loss = 0%, RTA = 1.35 ms [14:54:35] (03CR) 10Faidon Liambotis: [C: 031] Add parser function secret() to get secret data [puppet] - 10https://gerrit.wikimedia.org/r/223494 (owner: 10BBlack) [14:56:00] (back later) [14:56:05] (03PS1) 10Filippo Giunchedi: cassandra: don't cronspam cassandra-metrics-collector output [puppet] - 10https://gerrit.wikimedia.org/r/223795 (https://phabricator.wikimedia.org/T104208) [14:56:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: don't cronspam cassandra-metrics-collector output [puppet] - 10https://gerrit.wikimedia.org/r/223795 (https://phabricator.wikimedia.org/T104208) (owner: 10Filippo Giunchedi) [14:56:58] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1441449 (10akosiaris) [14:56:59] 6operations, 7HTTPS, 7LDAP, 5Patch-For-Review: update ldap-mirror.wikimedia.org certificate to sha256 - https://phabricator.wikimedia.org/T105187#1441447 (10akosiaris) 5Open>3Resolved Done [14:57:04] PROBLEM - Apache HTTP on mw1152 is CRITICAL: Connection refused [14:57:17] jouncebot: !next [14:57:29] (bad syntax?) [14:57:30] paravoid: thanks! :) [14:57:44] PROBLEM - HHVM rendering on mw1152 is CRITICAL: Connection refused [14:57:44] !next [14:58:23] PROBLEM - RAID on mw1152 is CRITICAL: Connection refused by host [14:58:35] PROBLEM - configured eth on mw1152 is CRITICAL: Connection refused by host [14:58:38] (03PS4) 10Filippo Giunchedi: Reduce the compaction throughput limit to 60mb/s [puppet] - 10https://gerrit.wikimedia.org/r/223722 (owner: 10GWicke) [14:58:43] PROBLEM - dhclient process on mw1152 is CRITICAL: Connection refused by host [14:58:44] PROBLEM - nutcracker port on mw1152 is CRITICAL: Connection refused by host [14:58:44] PROBLEM - puppet last run on mw1152 is CRITICAL: Connection refused by host [14:58:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Reduce the compaction throughput limit to 60mb/s [puppet] - 10https://gerrit.wikimedia.org/r/223722 (owner: 10GWicke) [14:58:54] PROBLEM - nutcracker process on mw1152 is CRITICAL: Connection refused by host [14:59:23] PROBLEM - Disk space on mw1152 is CRITICAL: Connection refused by host [14:59:32] PROBLEM - HHVM processes on mw1152 is CRITICAL: Connection refused by host [14:59:42] <_joe_> I scheduled downtime, wtf [15:00:02] PROBLEM - salt-minion processes on mw1152 is CRITICAL: Connection refused by host [15:00:04] manybubbles anomie ostriches marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150709T1500). [15:00:13] PROBLEM - DPKG on mw1152 is CRITICAL: Connection refused by host [15:00:16] _joe_: during SWAT? [15:00:28] who is SWATing? [15:00:40] <_joe_> kart__: what has reimaging a server to do with swat? [15:00:45] <_joe_> please elaborate [15:01:17] _joe_: excuse my ignorance :) [15:02:18] Krenair: are you SWATing? [15:02:26] (03PS4) 10BBlack: Add parser function secret() to get secret data [puppet] - 10https://gerrit.wikimedia.org/r/223494 [15:02:36] (03PS1) 10Giuseppe Lavagetto: conftool: add ocg1003 to the pdf cluster [puppet] - 10https://gerrit.wikimedia.org/r/223797 [15:02:41] (03CR) 10BBlack: [C: 032 V: 032] Add parser function secret() to get secret data [puppet] - 10https://gerrit.wikimedia.org/r/223494 (owner: 10BBlack) [15:03:57] * kart__ wondering where SWATers are?? [15:05:07] jouncebot: next [15:05:07] In 1 hour(s) and 54 minute(s): Planet Aggregator SSL update (misc-web) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150709T1700) [15:05:26] ^ is the right syntax to ask [15:05:52] bblack: thanks. [15:06:07] bblack: is there anything like 'right now'? :) [15:06:17] (just to remind Where are thou?) [15:06:32] (03PS2) 10Giuseppe Lavagetto: conftool: add ocg1003 to the pdf cluster [puppet] - 10https://gerrit.wikimedia.org/r/223797 [15:06:50] there's one scheduled for 15:00: https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0July.C2.A009 [15:07:16] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: add ocg1003 to the pdf cluster [puppet] - 10https://gerrit.wikimedia.org/r/223797 (owner: 10Giuseppe Lavagetto) [15:08:01] (which jouncebot announced at 15:00 above, also) [15:08:12] bblack: yes. Waiting for someone from list who can SWAT for me. [15:08:25] ping them? [15:09:01] Krenair manybubbles anomie ostriches marktraceur: ^ [15:09:17] manybubbles: anomie ostriches marktraceur Krenair : SWAT! [15:09:20] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [15:09:53] !log bounced cassandra on restbase1004 [15:09:57] Logged the message, Master [15:09:58] Ugh why did I answer [15:10:14] Hi, I'm back [15:10:18] OK so...I have a meeting in twenty minutes and I'm not currently set up to do SWAT, is there someone...aha [15:10:19] but it looks like marktraceur got here first [15:10:27] Krenair: No, no, please [15:11:11] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.005 second response time on port 9042 [15:11:21] (03PS3) 10Alex Monk: Update Campaigns config as per I2d77318b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223742 (owner: 10KartikMistry) [15:11:32] Krenair: thanks [15:11:52] Krenair: second patch need to merge/deploy asap after above change [15:11:53] I'm not sure about this one kart__ [15:12:00] oh I see [15:12:07] you're backporting the change and fixing the config at the same time [15:12:50] yes [15:12:50] Don't we also need wmf13 kart__? [15:12:58] krenair@tin:~$ mwversionsinuse [15:12:58] 1.26wmf12 1.26wmf13 [15:13:08] Krenair: oh. we already branched. [15:13:11] yep. [15:13:32] Krenair: https://gerrit.wikimedia.org/r/#/c/223737/ [15:13:34] yeah, wmf13 goes to wikipedias later [15:13:42] Krenair: we just need to merge it then, right? [15:13:46] no scap. [15:13:59] it's already on some sites [15:14:07] so we do need to sync it [15:14:19] ah. yes. thanks! [15:14:31] Should I merge, or will you do that? [15:15:02] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1441514 (10Dzahn) Thanks very much for helping @Gage. That fixed the logstash issues. --- So, i have: - added aggregator for ULSFO on bast4001 https://gerrit.wikimedia.or... [15:17:16] kart__, we want the config first, then the backports, right? [15:17:42] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1441523 (10Papaul) [15:17:43] 6operations: list xfp/sfp+ inventory @ codfw - https://phabricator.wikimedia.org/T105170#1441521 (10Papaul) 5Open>3Resolved i looked in all the boxes that I have on site and found only 2 QSFP's and 1 XFP. I update also the Google doc [15:17:47] Either is fine. Just need to have minimum gap. [15:18:12] (03CR) 10Alex Monk: [C: 032] Update Campaigns config as per I2d77318b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223742 (owner: 10KartikMistry) [15:18:20] (03Merged) 10jenkins-bot: Update Campaigns config as per I2d77318b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223742 (owner: 10KartikMistry) [15:18:39] PROBLEM - Apache HTTP on mw1160 is CRITICAL - Socket timeout after 10 seconds [15:20:30] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.096 second response time [15:21:41] Krenair: should I merge wmf12/wmf13 patches? [15:21:47] I did it kart__ [15:21:54] cool! [15:22:57] kart__, I think it's all ready, going to start now [15:23:13] Krenair: Sure! [15:23:31] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/223742/ (duration: 00m 12s) [15:23:35] Logged the message, Master [15:23:54] !log krenair Synchronized php-1.26wmf13/extensions/ContentTranslation: https://gerrit.wikimedia.org/r/#/c/223737/ (duration: 00m 12s) [15:23:59] Logged the message, Master [15:24:14] !log krenair Synchronized php-1.26wmf12/extensions/ContentTranslation: https://gerrit.wikimedia.org/r/#/c/223739/ (duration: 00m 12s) [15:24:18] Logged the message, Master [15:24:21] kart__, please test [15:24:41] _joe_, mw1152.eqiad.wmnet host key verification failed - is that you? [15:25:11] <_joe_> Krenair: yes, reimaging it [15:25:19] okay, so it's not actually serving traffic? [15:25:24] or... scaling stuff [15:25:56] PROBLEM - configured eth on mw1152 is CRITICAL: Connection refused by host [15:26:16] PROBLEM - dhclient process on mw1152 is CRITICAL: Connection refused by host [15:26:56] PROBLEM - nutcracker port on mw1152 is CRITICAL: Connection refused by host [15:26:57] (03PS2) 10Aklapper: Allow aklapper to reset user auths and delete accounts in Phab [puppet] - 10https://gerrit.wikimedia.org/r/219151 [15:27:06] Krenair: thanks. Testing! [15:27:16] PROBLEM - nutcracker process on mw1152 is CRITICAL: Connection refused by host [15:27:17] PROBLEM - DPKG on mw1152 is CRITICAL: Connection refused by host [15:27:17] PROBLEM - puppet last run on mw1152 is CRITICAL: Connection refused by host [15:27:26] PROBLEM - salt-minion processes on mw1152 is CRITICAL: Connection refused by host [15:27:26] PROBLEM - Disk space on mw1152 is CRITICAL: Connection refused by host [15:27:51] (03CR) 10Chad: [C: 032] Add a --all option to updateBranchPointers to update all branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213859 (owner: 1020after4) [15:27:55] PROBLEM - HHVM processes on mw1152 is CRITICAL: Connection refused by host [15:28:22] (03Merged) 10jenkins-bot: Add a --all option to updateBranchPointers to update all branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213859 (owner: 1020after4) [15:28:25] (03PS3) 10Aklapper: Allow aklapper to reset user auths and delete accounts in Phab [puppet] - 10https://gerrit.wikimedia.org/r/219151 [15:28:46] PROBLEM - RAID on mw1152 is CRITICAL: Connection refused by host [15:29:32] (03CR) 1020after4: [C: 031] Allow aklapper to reset user auths and delete accounts in Phab [puppet] - 10https://gerrit.wikimedia.org/r/219151 (owner: 10Aklapper) [15:30:49] Krenair: testing done. Thanks a lot!!! [15:30:50] (03PS1) 10Giuseppe Lavagetto: conftool: add mw1152 to the imagescaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/223804 [15:31:04] looks like ostriches is trying to deploy something [15:31:07] I'm done now anyway [15:31:15] Not really, it doesn't need a sync [15:31:23] I merged it, realized it was mw-config [15:31:37] So just pulled to tin so icinga wouldn't complain. [15:31:46] 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie - https://phabricator.wikimedia.org/T98042#1441591 (10bd808) Elasticsearch 1.6.0 should be the default jessie deb now that it the default for precise/trusty. [15:31:48] yeah, luckily I didn't need to revert [15:32:13] 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie - https://phabricator.wikimedia.org/T98042#1441598 (10bd808) [15:32:15] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1441597 (10bd808) [15:33:04] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1245961 (10bd808) [15:33:15] (03PS2) 10Giuseppe Lavagetto: conftool: add mw1152 to the imagescaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/223804 [15:33:30] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] conftool: add mw1152 to the imagescaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/223804 (owner: 10Giuseppe Lavagetto) [15:33:34] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1245961 (10bd808) [15:34:15] !log bounced cassandra on restbase1004 [15:35:15] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [15:35:37] PROBLEM - puppet last run on cp3034 is CRITICAL puppet fail [15:38:34] so iridium was set up with a very small / volume and a very large /srv. Problem is, logs are on /var (root filesystem) and it nearly filled up yesterday. At the current rate, phd and apache2 logs will fill it again soonish. What should I do, make /var/log/phd and /var/log/apache2 into symlinks to /srv/log/phd and /srv/log/apache2? [15:39:29] the rate of logging has been increasing, I think due to all the git activity since we've imported a lot of repositories [15:39:48] no lvm eh? tut tut [15:40:15] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:40:34] I don't think so? it's got raid1 on two physical disks it seems [15:41:06] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [15:41:08] and three partitions on each, with one tiny partition unused [15:43:05] the root partition is only about 10 gigabytes. I moved 2 gigs of phd logs last night to free space [15:43:44] apache2 has a 52 week log rotation which isn't aggressive enough, and phd doesn't seem to automatically rotate it's logs [15:44:41] heh the latter is certainly an issue in itself [15:45:27] yes, especially since phd is responsible for the majority of the logging on that box [15:45:36] that 2gb log was all one file [15:48:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:48:51] twentyafterfour: mind filing bugs for the small partition and phd log rotation? it doesn't seem immediately on fire but something we should look at shortly [15:50:35] RECOVERY - Disk space on mw1152 is OK: DISK OK [15:50:35] RECOVERY - salt-minion processes on mw1152 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:50:56] RECOVERY - configured eth on mw1152 is OK - interfaces up [15:50:56] RECOVERY - HHVM processes on mw1152 is OK: PROCS OK: 6 processes with command name hhvm [15:51:16] RECOVERY - dhclient process on mw1152 is OK: PROCS OK: 0 processes with command name dhclient [15:51:54] YuviPanda: https://www.openssl.org/news/secadv_20150709.txt [15:51:56] RECOVERY - nutcracker port on mw1152 is OK: TCP OK - 0.000 second response time on port 11212 [15:51:56] RECOVERY - RAID on mw1152 is OK no RAID installed [15:52:16] RECOVERY - nutcracker process on mw1152 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [15:52:25] RECOVERY - DPKG on mw1152 is OK: All packages OK [15:52:25] PROBLEM - puppet last run on mw1152 is CRITICAL Puppet has 6 failures [15:52:46] PROBLEM - Cassandra database on restbase1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [15:53:06] RECOVERY - puppet last run on cp3034 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:53:21] legoktm ^ [15:53:24] looking at restbase1002 [15:53:29] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1441645 (10bd808) Once I complete {T105101} I think we will be ready to start rebuilding logstash100[1-3]. I think this is the list of things we want to do as part of this task: * [ ] {T98042} * [ ] Remov... [15:54:06] PROBLEM - Cassanda CQL query interface on restbase1002 is CRITICAL: Connection refused [15:54:16] RECOVERY - puppet last run on mw1152 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:54:55] Bsadowski1: the openssl versions we use are not affected [15:57:06] !log restart cassandra on restbase1002 [15:57:11] Logged the message, Master [15:57:15] Oh okay [15:57:18] :) [15:58:35] RECOVERY - Cassandra database on restbase1002 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [15:59:57] RECOVERY - Cassanda CQL query interface on restbase1002 is OK: TCP OK - 0.002 second response time on port 9042 [16:05:15] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [16:06:59] <_joe_> !log repooling mw1152 [16:07:03] Logged the message, Master [16:07:55] 10Ops-Access-Requests, 6operations, 6Discovery, 10SEO, 3Discovery-Analysis-Sprint: Get Oliver Keyes access to Google Webmaster Tools for all Wikimedia domains - https://phabricator.wikimedia.org/T101157#1441701 (10Deskana) @chasemp I responded to your email a while back and said your plan sounded good. D... [16:10:15] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:13:36] (03PS1) 10RobH: removing rapidssl_ca sha1 intermediary from repo [puppet] - 10https://gerrit.wikimedia.org/r/223816 [16:15:34] (03CR) 10RobH: [C: 031] "I think this will work fine, but it isn't a rush. As such, I'd like either Brandon or Faidon to confirm this would work." [puppet] - 10https://gerrit.wikimedia.org/r/223816 (owner: 10RobH) [16:17:25] PROBLEM - puppet last run on ms-be1018 is CRITICAL Puppet has 1 failures [16:17:47] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1441750 (10RobH) I've confirmed with Jeff that he doesn't use any copies of the rapidssl SHA1 intermediary cert from our public repo (he has his own copy in frack repo). As such, I've submitted https://gerrit... [16:22:20] 6operations, 10ops-eqiad, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1441778 (10fgiunchedi) [16:22:34] 6operations, 10ops-eqiad, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1441780 (10fgiunchedi) a:3Cmjohnson [16:22:56] robh: on the rapidssl sha1 thing: do we have some validation that there's no certs laying around the puppet repo still signed by it? some forgotten legacy corner case that's still in use or whatever? [16:23:24] (or worse, unpuppetized ones on live systems, using apache SSLCAPath or whatever lookups through it) [16:24:08] I guess if we know the inventory of live/valid ones we've ever had from RapidSSL, via the account portal, that counts as a complete list of ones we'd care about. [16:25:25] well, we did the repo search and then everyone that was fixed via that task [16:25:40] so over time, each one was tested [16:26:04] but i didnt go back and audit each individual https entry clusterwide since ending the replacements via https presented service [16:26:10] just via the backend repo existence of certificates [16:26:23] (03CR) 10Tim Landscheidt: labstore: Escape grants properly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/222265 (https://phabricator.wikimedia.org/T101758) (owner: 10Yuvipanda) [16:26:36] there was no check for unpuppetized ones on systems nope [16:26:45] but man i hope those dont exist. [16:27:08] i figure its easier to find that post removal of file via simple salt searches though right? [16:27:25] 6operations, 6Services: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#1441808 (10Dzahn) I basically gave up on this task after trying and asked for help. So just re-assigning it to me may not be the most effective choice. [16:29:08] robh: I don't think we have to check for unpuppetized ones. if they still had validity (in cert lifetime terms), they'd still be listed on our rapidssl account and you'd have seen them there, I think [16:31:38] 6operations, 10ops-eqiad, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1441838 (10Cmjohnson) Okay, did you remove the disk already? [16:32:02] ther eis no listing of rapidssl account [16:32:05] its one of the reasons we left [16:32:15] each one is its own independent issue via email portals [16:32:20] (no central mgmt) [16:32:55] RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:33:01] but i agree that i wasnt really worried about unpuppetized https in production, im fairly confident it doesnt exist. [16:33:28] (03CR) 10Tim Landscheidt: [C: 04-1] labstore: Escape _s properly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223780 (owner: 10Yuvipanda) [16:35:15] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [16:37:17] 6operations, 10ops-eqiad, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1441863 (10fgiunchedi) a:5Cmjohnson>3Papaul nevermind Chris, I misread the hostname, this machine is several kms away from you :) moving to @papaul [16:37:56] (03PS4) 10Dzahn: mail: ferm rules for mailman [puppet] - 10https://gerrit.wikimedia.org/r/223279 (https://phabricator.wikimedia.org/T104980) (owner: 10John F. Lewis) [16:38:55] !log bounced cassandra on restbase1006 [16:38:57] RECOVERY - Disk space on ms-be2013 is OK: DISK OK [16:39:00] Logged the message, Master [16:39:16] 6operations, 10ops-eqiad, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1441885 (10Cmjohnson) okay, so the disk at slot2 which would be /dev/sdc is missing altogether from megacli the others look good cmjohnson@ms-be2013:~$ sudo mega... [16:40:05] 6operations, 7Wikimedia-log-errors: mw1150 spams "memcached error for key" since May 29 3:00am UTC - https://phabricator.wikimedia.org/T100780#1441894 (10demon) p:5Triage>3Normal [16:40:07] (03CR) 10Dzahn: [C: 032] mail: ferm rules for mailman [puppet] - 10https://gerrit.wikimedia.org/r/223279 (https://phabricator.wikimedia.org/T104980) (owner: 10John F. Lewis) [16:40:11] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, 7Wikimedia-log-errors: eval.php on silver shows SERVER_NAME undefined index errors - https://phabricator.wikimedia.org/T98615#1441899 (10demon) p:5Triage>3Normal [16:40:15] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:41:27] (03CR) 10Merlijn van Deen: [C: 04-1] labstore: Escape _s properly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223780 (owner: 10Yuvipanda) [16:42:56] (03CR) 10Merlijn van Deen: [C: 031] labstore: Use safe_load vs load for yaml loading [puppet] - 10https://gerrit.wikimedia.org/r/223074 (owner: 10Yuvipanda) [16:43:41] (03CR) 10Merlijn van Deen: [C: 031] labstore: Be less noisy in logging [puppet] - 10https://gerrit.wikimedia.org/r/223075 (owner: 10Yuvipanda) [16:45:45] 6operations, 10ops-eqiad, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1441946 (10fgiunchedi) yep I've tried umount but of course it is stuck, I'll reboot the machine @papaul, I've located the disk on the controller so it should be... [16:47:21] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1441953 (10yuvipanda) *bump* [16:48:20] !log reboot ms-be2013 T105213 [16:48:24] Logged the message, Master [16:51:56] 6operations, 6Discovery, 7Elasticsearch: logstash partman recipe huge root partition - https://phabricator.wikimedia.org/T104035#1441979 (10fgiunchedi) [16:51:57] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1441978 (10fgiunchedi) [16:52:32] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1441981 (10dduvall) I submitted a patch to fix system gem integration a while. back. I'm not sure if it was applied to the latest package but it would be great to check... [16:52:38] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1245961 (10fgiunchedi) I've added {T104035} while we're at it re imagining servers [16:53:14] 6operations, 10ops-eqiad, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1441988 (10Papaul) disk @ slot2 status led's are off. (not green not amber) [16:53:15] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [16:53:25] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [16:53:41] !log bounced cassandra on restbase1004 [16:55:06] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [16:55:16] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.000 second response time on port 9042 [16:56:43] (03PS3) 10Gage: Add flag --all-projects to projectviews aggregator [puppet] - 10https://gerrit.wikimedia.org/r/223573 (https://phabricator.wikimedia.org/T95339) (owner: 10Mforns) [16:57:27] (03PS1) 10Yuvipanda: ldap: Allow projects to override user's loginshells [puppet] - 10https://gerrit.wikimedia.org/r/223828 (https://phabricator.wikimedia.org/T102395) [16:57:43] (03CR) 10Gage: [C: 032] Add flag --all-projects to projectviews aggregator [puppet] - 10https://gerrit.wikimedia.org/r/223573 (https://phabricator.wikimedia.org/T95339) (owner: 10Mforns) [16:57:45] bd808: FYI: logstash1003 has a bad disk [16:58:21] _joe_: If HHVM is coming to imagescalers, does that mean we have a final fix for https://phabricator.wikimedia.org/T91468? [16:58:23] (03CR) 10Dzahn: [C: 04-1] "this check command will translate to a check_http command using the -S switch for HTTPS, which will lead to:" [puppet] - 10https://gerrit.wikimedia.org/r/223364 (https://phabricator.wikimedia.org/T104948) (owner: 10John F. Lewis) [16:58:32] cmjohnson1: one of the seagates or the main disk? [16:58:49] seagates [16:58:57] *nod* [16:59:41] Does it really take 24h to wipe those 3T disks? [17:00:04] RobH mutante: Respected human, time to deploy Planet Aggregator SSL update (misc-web) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150709T1700). Please do the needful. [17:00:09] (03CR) 10Merlijn van Deen: [C: 04-1] "the general behavior looks good to me" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/222342 (owner: 10Yuvipanda) [17:00:22] heh, forgot to pull that off deployment page. [17:00:30] bd808: 24h sounds almost low :) [17:00:50] oh, jouncebot talks to me, heh [17:00:55] well, i'm here [17:01:03] mutante: but brandon merged for us the other day ;D [17:01:09] so we dont have to do shit, its done! [17:01:09] robh: even better:) [17:01:12] nice [17:01:12] \o/ [17:01:16] (03Abandoned) 10Yuvipanda: labstore: Rewrite sync-exports to python [puppet] - 10https://gerrit.wikimedia.org/r/222342 (owner: 10Yuvipanda) [17:01:25] was fixed when he was fixing the root certificate preference issue [17:01:32] well, immediately afterward [17:01:37] cool:) [17:01:44] bd808 it could take longer for 3TB it makes several passes [17:01:45] jouncebot: next [17:01:46] In 0 hour(s) and 58 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150709T1800) [17:05:16] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [17:05:34] (03PS3) 10Dzahn: static bugzilla: add http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/223364 (https://phabricator.wikimedia.org/T104948) (owner: 10John F. Lewis) [17:06:36] (03PS4) 10Dzahn: static bugzilla: add http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/223364 (https://phabricator.wikimedia.org/T104948) (owner: 10John F. Lewis) [17:07:13] (03PS5) 10Dzahn: static bugzilla: add http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/223364 (https://phabricator.wikimedia.org/T104948) (owner: 10John F. Lewis) [17:07:20] (03CR) 10Dzahn: [C: 032] static bugzilla: add http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/223364 (https://phabricator.wikimedia.org/T104948) (owner: 10John F. Lewis) [17:09:00] (03PS1) 10Yuvipanda: labstore: Excape grants *properly* [puppet] - 10https://gerrit.wikimedia.org/r/223830 [17:09:10] \*properly\* you mean [17:09:15] Coren: ^ [17:09:40] <_joe_> ostriches: since some time in fact [17:10:15] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:10:47] (03CR) 10coren: [C: 032] "Gotta love quoting quoted strings, no matter the language." [puppet] - 10https://gerrit.wikimedia.org/r/223830 (owner: 10Yuvipanda) [17:10:52] _joe_: btw by migrating image scalers i didn't mean we should migrate them to oman :P [17:11:54] <_joe_> heh [17:12:47] 6operations, 7Icinga, 5Patch-For-Review: monitor HTTP on bromine.eqiad.wmnet - https://phabricator.wikimedia.org/T104948#1442084 (10Dzahn) Should we add a separate check for each virtual host in their respective role? I tend to say yes because that makes moving roles around flexible and every service has the... [17:15:25] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Ferm rules for mailman - https://phabricator.wikimedia.org/T104980#1442108 (10Dzahn) merged. this did not do anything on sodium, but as soon as spin up a VM or new server to replace it and add base::firewall, it will be applied there. then we can do... [17:17:20] (03CR) 10Dzahn: [C: 04-1] "NFS" [puppet] - 10https://gerrit.wikimedia.org/r/205903 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [17:17:54] (03CR) 10Dzahn: "we want fixed port assignments - see ticket" [puppet] - 10https://gerrit.wikimedia.org/r/205904 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [17:18:04] 6operations, 10ops-eqiad, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1442116 (10fgiunchedi) a:5Papaul>3fgiunchedi thanks @papaul, I made a brown paperbag mistake and cleared the raid config (not the foreign config) on reboot. I... [17:19:06] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [17:20:36] 6operations, 10ops-eqiad, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1442120 (10Dzahn) p:5Triage>3Normal [17:21:06] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 43.12 ms [17:22:07] (03PS2) 10Dzahn: tmh (videoscaler): add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/223244 (https://phabricator.wikimedia.org/T104970) [17:22:09] 6operations, 10ops-codfw, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1442123 (10fgiunchedi) [17:22:35] !log shutting down helium for a few minutes to move within the same row [17:22:40] Logged the message, Master [17:23:46] PROBLEM - Restbase root url on restbase1004 is CRITICAL - Socket timeout after 10 seconds [17:25:10] (03CR) 10coren: labstore: Escape _s properly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223780 (owner: 10Yuvipanda) [17:26:55] PROBLEM - Apache HTTP on mw1116 is CRITICAL - Socket timeout after 10 seconds [17:26:56] PROBLEM - Apache HTTP on mw1132 is CRITICAL - Socket timeout after 10 seconds [17:26:56] PROBLEM - Apache HTTP on mw1131 is CRITICAL - Socket timeout after 10 seconds [17:26:56] PROBLEM - Apache HTTP on mw1137 is CRITICAL - Socket timeout after 10 seconds [17:26:56] PROBLEM - Apache HTTP on mw1135 is CRITICAL - Socket timeout after 10 seconds [17:26:56] PROBLEM - Apache HTTP on mw1122 is CRITICAL - Socket timeout after 10 seconds [17:26:56] PROBLEM - Apache HTTP on mw1143 is CRITICAL - Socket timeout after 10 seconds [17:26:57] PROBLEM - Apache HTTP on mw1138 is CRITICAL - Socket timeout after 10 seconds [17:26:57] PROBLEM - HHVM rendering on mw1135 is CRITICAL - Socket timeout after 10 seconds [17:27:05] PROBLEM - HHVM rendering on mw1131 is CRITICAL - Socket timeout after 10 seconds [17:27:05] PROBLEM - HHVM rendering on mw1123 is CRITICAL - Socket timeout after 10 seconds [17:27:05] PROBLEM - HHVM rendering on mw1142 is CRITICAL - Socket timeout after 10 seconds [17:27:05] PROBLEM - Apache HTTP on mw1129 is CRITICAL - Socket timeout after 10 seconds [17:27:05] PROBLEM - HHVM rendering on mw1148 is CRITICAL - Socket timeout after 10 seconds [17:27:06] PROBLEM - Apache HTTP on mw1123 is CRITICAL - Socket timeout after 10 seconds [17:27:06] PROBLEM - Apache HTTP on mw1142 is CRITICAL - Socket timeout after 10 seconds [17:27:06] PROBLEM - Host helium is DOWN: PING CRITICAL - Packet loss = 100% [17:27:09] ? [17:27:10] <_joe_> what? [17:27:16] PROBLEM - Apache HTTP on mw1232 is CRITICAL - Socket timeout after 10 seconds [17:27:16] PROBLEM - HHVM rendering on mw1132 is CRITICAL - Socket timeout after 10 seconds [17:27:16] PROBLEM - Apache HTTP on mw1115 is CRITICAL - Socket timeout after 10 seconds [17:27:16] PROBLEM - HHVM rendering on mw1117 is CRITICAL - Socket timeout after 10 seconds [17:27:16] PROBLEM - Apache HTTP on mw1193 is CRITICAL - Socket timeout after 10 seconds [17:27:17] PROBLEM - Apache HTTP on mw1130 is CRITICAL - Socket timeout after 10 seconds [17:27:17] PROBLEM - Apache HTTP on mw1126 is CRITICAL - Socket timeout after 10 seconds [17:27:18] eh, helium is poolcounter [17:27:24] <_joe_> fuck [17:27:25] and chris just logged he is moving it around [17:27:25] PROBLEM - Apache HTTP on mw1144 is CRITICAL - Socket timeout after 10 seconds [17:27:26] PROBLEM - HHVM rendering on mw1138 is CRITICAL - Socket timeout after 10 seconds [17:27:26] PROBLEM - HHVM rendering on mw1146 is CRITICAL - Socket timeout after 10 seconds [17:27:26] PROBLEM - HHVM rendering on mw1119 is CRITICAL - Socket timeout after 10 seconds [17:27:26] PROBLEM - HHVM rendering on mw1116 is CRITICAL - Socket timeout after 10 seconds [17:27:26] RECOVERY - Restbase root url on restbase1004 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.016 second response time [17:27:27] PROBLEM - HHVM rendering on mw1127 is CRITICAL - Socket timeout after 10 seconds [17:27:27] PROBLEM - HHVM rendering on mw1145 is CRITICAL - Socket timeout after 10 seconds [17:27:32] <_joe_> mutante: what? [17:27:35] PROBLEM - Apache HTTP on mw1127 is CRITICAL - Socket timeout after 10 seconds [17:27:35] PROBLEM - Apache HTTP on mw1119 is CRITICAL - Socket timeout after 10 seconds [17:27:35] PROBLEM - Apache HTTP on mw1120 is CRITICAL - Socket timeout after 10 seconds [17:27:35] PROBLEM - HHVM rendering on mw1194 is CRITICAL - Socket timeout after 10 seconds [17:27:35] PROBLEM - HHVM rendering on mw1137 is CRITICAL - Socket timeout after 10 seconds [17:27:36] PROBLEM - Apache HTTP on mw1136 is CRITICAL - Socket timeout after 10 seconds [17:27:36] PROBLEM - Apache HTTP on mw1114 is CRITICAL - Socket timeout after 10 seconds [17:27:38] single point of failure, ftw [17:27:39] <_joe_> its' gonna kill all production [17:27:43] I was about to mention something about my timing out when I try to render pages not cached. [17:27:46] PROBLEM - HHVM rendering on mw1143 is CRITICAL - Socket timeout after 10 seconds [17:27:46] PROBLEM - HHVM rendering on mw1115 is CRITICAL - Socket timeout after 10 seconds [17:27:46] PROBLEM - HHVM rendering on mw1129 is CRITICAL - Socket timeout after 10 seconds [17:27:46] PROBLEM - Apache HTTP on mw1133 is CRITICAL - Socket timeout after 10 seconds [17:27:52] Everything's down [17:27:55] PROBLEM - Apache HTTP on mw1134 is CRITICAL - Socket timeout after 10 seconds [17:27:55] PROBLEM - Apache HTTP on mw1194 is CRITICAL - Socket timeout after 10 seconds [17:27:56] PROBLEM - Apache HTTP on mw1121 is CRITICAL - Socket timeout after 10 seconds [17:27:56] PROBLEM - Apache HTTP on mw1148 is CRITICAL - Socket timeout after 10 seconds [17:27:56] PROBLEM - Apache HTTP on mw1125 is CRITICAL - Socket timeout after 10 seconds [17:27:56] PROBLEM - Apache HTTP on mw1140 is CRITICAL - Socket timeout after 10 seconds [17:27:59] 10:23 < cmjohnson1> !log shutting down helium for a few minutes to move within the same row [17:28:03] <_joe_> soemone call chris [17:28:05] PROBLEM - HHVM rendering on mw1124 is CRITICAL - Socket timeout after 10 seconds [17:28:05] PROBLEM - HHVM rendering on mw1139 is CRITICAL - Socket timeout after 10 seconds [17:28:06] PROBLEM - Apache HTTP on mw1146 is CRITICAL - Socket timeout after 10 seconds [17:28:10] ok [17:28:14] (03CR) 10Jcrespo: "Comment answering some of the questions." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223780 (owner: 10Yuvipanda) [17:28:15] PROBLEM - HHVM rendering on mw1136 is CRITICAL - Socket timeout after 10 seconds [17:28:16] PROBLEM - HHVM rendering on mw1126 is CRITICAL - Socket timeout after 10 seconds [17:28:16] PROBLEM - HHVM rendering on mw1122 is CRITICAL - Socket timeout after 10 seconds [17:28:16] PROBLEM - HHVM rendering on mw1130 is CRITICAL - Socket timeout after 10 seconds [17:28:16] PROBLEM - HHVM rendering on mw1121 is CRITICAL - Socket timeout after 10 seconds [17:28:16] PROBLEM - HHVM rendering on mw1193 is CRITICAL - Socket timeout after 10 seconds [17:28:16] cmjohnson1: [17:28:16] PROBLEM - HHVM rendering on mw1128 is CRITICAL - Socket timeout after 10 seconds [17:28:17] PROBLEM - HHVM rendering on mw1147 is CRITICAL - Socket timeout after 10 seconds [17:28:27] PROBLEM - HHVM rendering on mw1232 is CRITICAL - Socket timeout after 10 seconds [17:28:28] PROBLEM - HHVM rendering on mw1152 is CRITICAL - Socket timeout after 10 seconds [17:28:28] PROBLEM - HHVM rendering on mw1114 is CRITICAL - Socket timeout after 10 seconds [17:28:28] PROBLEM - HHVM rendering on mw1133 is CRITICAL - Socket timeout after 10 seconds [17:28:28] PROBLEM - HHVM rendering on mw1144 is CRITICAL - Socket timeout after 10 seconds [17:28:28] PROBLEM - Apache HTTP on mw1145 is CRITICAL - Socket timeout after 10 seconds [17:28:28] PROBLEM - HHVM rendering on mw1125 is CRITICAL - Socket timeout after 10 seconds [17:28:35] PROBLEM - Apache HTTP on mw1117 is CRITICAL - Socket timeout after 10 seconds [17:28:35] PROBLEM - Apache HTTP on mw1147 is CRITICAL - Socket timeout after 10 seconds [17:28:36] PROBLEM - Apache HTTP on mw1128 is CRITICAL - Socket timeout after 10 seconds [17:28:36] PROBLEM - Apache HTTP on mw1139 is CRITICAL - Socket timeout after 10 seconds [17:28:36] PROBLEM - Apache HTTP on mw1199 is CRITICAL - Socket timeout after 10 seconds [17:28:36] PROBLEM - HHVM rendering on mw1120 is CRITICAL - Socket timeout after 10 seconds [17:28:36] PROBLEM - HHVM rendering on mw1134 is CRITICAL - Socket timeout after 10 seconds [17:28:37] PROBLEM - HHVM rendering on mw1199 is CRITICAL - Socket timeout after 10 seconds [17:28:37] PROBLEM - HHVM rendering on mw1140 is CRITICAL - Socket timeout after 10 seconds [17:28:38] PROBLEM - HHVM rendering on mw1153 is CRITICAL - Socket timeout after 10 seconds [17:28:49] <_joe_> call as in call by telephone [17:29:06] <_joe_> ori: can we disable poolcounter? [17:29:09] i am calling [17:29:24] <_joe_> I guess not easily [17:29:35] we can [17:29:36] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 28.57% of data above the critical threshold [500.0] [17:29:45] is it 10.64.0.179 ? [17:30:07] RECOVERY - HHVM rendering on mw1152 is OK: HTTP OK: HTTP/1.1 200 OK - 65026 bytes in 0.143 second response time [17:30:11] <_joe_> hoo|busy: we should just move it to potassium? [17:30:17] hoo|busy: yes [17:30:24] 6operations, 6Services: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#1442170 (10cscott) It's possible that nobody knows how to decommission a host. I read through the source (`lib/threads/frontend.js`) and it looked to me like the only way ocg1003 should be pulling jobs is if someone is hi... [17:30:26] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL - Socket timeout after 10 seconds [17:30:32] (03PS1) 10Hoo man: helium is down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223837 [17:30:33] RECOVERY - HHVM rendering on mw1153 is OK: HTTP OK: HTTP/1.1 200 OK - 65026 bytes in 0.127 second response time [17:30:35] PROBLEM - Apache HTTP on mw1124 is CRITICAL - Socket timeout after 10 seconds [17:30:37] (03PS1) 10Ori.livneh: make mw1154 poolcounter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223838 [17:30:46] (03CR) 10Ori.livneh: [C: 032 V: 032] make mw1154 poolcounter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223838 (owner: 10Ori.livneh) [17:30:48] <_joe_> ori: we have potassium [17:30:49] <_joe_> wait [17:30:55] i just got voice mail. did leave message [17:30:59] shall I? [17:31:08] or well, ori's faster [17:31:15] PROBLEM - HHVM rendering on mw1227 is CRITICAL - Socket timeout after 10 seconds [17:31:15] PROBLEM - Apache HTTP on mw1189 is CRITICAL - Socket timeout after 10 seconds [17:31:15] <_joe_> ori: potassium is a poolcounter [17:31:16] PROBLEM - Apache HTTP on mw1235 is CRITICAL - Socket timeout after 10 seconds [17:31:16] PROBLEM - HHVM rendering on mw1233 is CRITICAL - Socket timeout after 10 seconds [17:31:16] PROBLEM - Apache HTTP on mw1202 is CRITICAL - Socket timeout after 10 seconds [17:31:16] PROBLEM - HHVM rendering on mw1222 is CRITICAL - Socket timeout after 10 seconds [17:31:19] <_joe_> so let's us that [17:31:25] PROBLEM - Apache HTTP on mw1201 is CRITICAL - Socket timeout after 10 seconds [17:31:25] PROBLEM - HHVM rendering on mw1224 is CRITICAL - Socket timeout after 10 seconds [17:31:26] PROBLEM - Apache HTTP on mw1223 is CRITICAL - Socket timeout after 10 seconds [17:31:26] PROBLEM - HHVM rendering on mw1203 is CRITICAL - Socket timeout after 10 seconds [17:31:26] PROBLEM - Apache HTTP on mw1228 is CRITICAL - Socket timeout after 10 seconds [17:31:26] PROBLEM - Apache HTTP on mw1221 is CRITICAL - Socket timeout after 10 seconds [17:31:36] PROBLEM - HHVM rendering on mw1229 is CRITICAL - Socket timeout after 10 seconds [17:31:36] PROBLEM - HHVM rendering on mw1190 is CRITICAL - Socket timeout after 10 seconds [17:31:36] PROBLEM - HHVM rendering on mw1189 is CRITICAL - Socket timeout after 10 seconds [17:31:36] PROBLEM - HHVM rendering on mw1198 is CRITICAL - Socket timeout after 10 seconds [17:31:42] ? [17:31:43] !log ori Synchronized wmf-config/PoolCounterSettings-eqiad.php: (no message) (duration: 00m 12s) [17:31:45] PROBLEM - HHVM busy threads on mw1123 is CRITICAL 66.67% of data above the critical threshold [86.4] [17:31:45] PROBLEM - Apache HTTP on mw1203 is CRITICAL - Socket timeout after 10 seconds [17:31:45] PROBLEM - Apache HTTP on mw1225 is CRITICAL - Socket timeout after 10 seconds [17:31:45] PROBLEM - Apache HTTP on mw1198 is CRITICAL - Socket timeout after 10 seconds [17:31:46] PROBLEM - Apache HTTP on mw1222 is CRITICAL - Socket timeout after 10 seconds [17:31:46] PROBLEM - HHVM busy threads on mw1126 is CRITICAL 66.67% of data above the critical threshold [86.4] [17:31:46] PROBLEM - HHVM busy threads on mw1133 is CRITICAL 66.67% of data above the critical threshold [86.4] [17:31:47] PROBLEM - Apache HTTP on mw1200 is CRITICAL - Socket timeout after 10 seconds [17:31:47] Logged the message, Master [17:31:56] PROBLEM - HHVM rendering on mw1208 is CRITICAL - Socket timeout after 10 seconds [17:31:56] PROBLEM - HHVM rendering on mw1197 is CRITICAL - Socket timeout after 10 seconds [17:31:57] PROBLEM - Apache HTTP on mw1227 is CRITICAL - Socket timeout after 10 seconds [17:31:57] PROBLEM - Apache HTTP on mw1192 is CRITICAL - Socket timeout after 10 seconds [17:32:05] PROBLEM - Apache HTTP on mw1190 is CRITICAL - Socket timeout after 10 seconds [17:32:06] PROBLEM - Apache HTTP on mw1231 is CRITICAL - Socket timeout after 10 seconds [17:32:06] PROBLEM - HHVM rendering on mw1223 is CRITICAL - Socket timeout after 10 seconds [17:32:06] PROBLEM - HHVM rendering on mw1230 is CRITICAL - Socket timeout after 10 seconds [17:32:06] PROBLEM - HHVM busy threads on mw1120 is CRITICAL 75.00% of data above the critical threshold [86.4] [17:32:12] !log installed poolcounter on mw1154 [17:32:15] PROBLEM - HHVM busy threads on mw1136 is CRITICAL 75.00% of data above the critical threshold [86.4] [17:32:15] PROBLEM - Apache HTTP on mw1208 is CRITICAL - Socket timeout after 10 seconds [17:32:16] PROBLEM - HHVM rendering on mw1201 is CRITICAL - Socket timeout after 10 seconds [17:32:16] Logged the message, Master [17:32:19] are we recovering? [17:32:24] Why set up a new one? [17:32:25] PROBLEM - Apache HTTP on mw1224 is CRITICAL - Socket timeout after 10 seconds [17:32:26] PROBLEM - Apache HTTP on mw1234 is CRITICAL - Socket timeout after 10 seconds [17:32:26] PROBLEM - Apache HTTP on mw1226 is CRITICAL - Socket timeout after 10 seconds [17:32:28] nope [17:32:35] PROBLEM - HHVM busy threads on mw1117 is CRITICAL 75.00% of data above the critical threshold [86.4] [17:32:35] PROBLEM - HHVM rendering on mw1235 is CRITICAL - Socket timeout after 10 seconds [17:32:35] PROBLEM - HHVM rendering on mw1195 is CRITICAL - Socket timeout after 10 seconds [17:32:36] PROBLEM - HHVM busy threads on mw1140 is CRITICAL 75.00% of data above the critical threshold [86.4] [17:32:37] site's up for me [17:32:37] Does it have base::firewall? [17:32:45] not for me [17:32:45] PROBLEM - HHVM queue size on mw1148 is CRITICAL 75.00% of data above the critical threshold [80.0] [17:32:45] PROBLEM - Apache HTTP on mw1197 is CRITICAL - Socket timeout after 10 seconds [17:32:46] PROBLEM - HHVM queue size on mw1143 is CRITICAL 66.67% of data above the critical threshold [80.0] [17:32:46] PROBLEM - HHVM rendering on mw1231 is CRITICAL - Socket timeout after 10 seconds [17:32:46] PROBLEM - HHVM rendering on mw1226 is CRITICAL - Socket timeout after 10 seconds [17:32:47] PROBLEM - HHVM rendering on mw1225 is CRITICAL - Socket timeout after 10 seconds [17:32:47] PROBLEM - HHVM rendering on mw1221 is CRITICAL - Socket timeout after 10 seconds [17:32:47] PROBLEM - HHVM rendering on mw1200 is CRITICAL - Socket timeout after 10 seconds [17:32:47] PROBLEM - HHVM rendering on mw1192 is CRITICAL - Socket timeout after 10 seconds [17:32:47] <_joe_> ori: remove the current poolcounter from the list [17:32:48] ori: Try an uncached page [17:32:56] PROBLEM - HHVM queue size on mw1116 is CRITICAL 77.78% of data above the critical threshold [80.0] [17:32:56] PROBLEM - HHVM rendering on mw1234 is CRITICAL - Socket timeout after 10 seconds [17:32:57] PROBLEM - HHVM rendering on mw1202 is CRITICAL - Socket timeout after 10 seconds [17:32:57] PROBLEM - HHVM busy threads on mw1135 is CRITICAL 75.00% of data above the critical threshold [86.4] [17:32:57] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 3.366 second response time [17:32:58] PROBLEM - HHVM busy threads on mw1194 is CRITICAL 75.00% of data above the critical threshold [115.2] [17:32:58] still up [17:33:02] hm [17:33:05] PROBLEM - HHVM queue size on mw1117 is CRITICAL 77.78% of data above the critical threshold [80.0] [17:33:05] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.090 second response time [17:33:06] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.885 second response time [17:33:06] PROBLEM - HHVM busy threads on mw1130 is CRITICAL 87.50% of data above the critical threshold [86.4] [17:33:06] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 6.767 second response time [17:33:06] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.034 second response time [17:33:06] PROBLEM - HHVM busy threads on mw1146 is CRITICAL 87.50% of data above the critical threshold [86.4] [17:33:06] RECOVERY - HHVM rendering on mw1138 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 3.153 second response time [17:33:07] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 3.579 second response time [17:33:15] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.998 second response time [17:33:15] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 4.332 second response time [17:33:15] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.589 second response time [17:33:15] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 5.247 second response time [17:33:15] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.344 second response time [17:33:16] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 6.095 second response time [17:33:16] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 3.170 second response time [17:33:17] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.462 second response time [17:33:17] RECOVERY - HHVM rendering on mw1127 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 5.116 second response time [17:33:18] PROBLEM - HHVM busy threads on mw1144 is CRITICAL 87.50% of data above the critical threshold [86.4] [17:33:18] PROBLEM - HHVM busy threads on mw1145 is CRITICAL 85.71% of data above the critical threshold [86.4] [17:33:19] PROBLEM - HHVM busy threads on mw1124 is CRITICAL 87.50% of data above the critical threshold [86.4] [17:33:19] PROBLEM - HHVM busy threads on mw1128 is CRITICAL 85.71% of data above the critical threshold [86.4] [17:33:19] ok, looks back [17:33:20] PROBLEM - HHVM queue size on mw1139 is CRITICAL 75.00% of data above the critical threshold [80.0] [17:33:22] <_joe_> yes we're back [17:33:36] RECOVERY - Apache HTTP on mw1222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.641 second response time [17:33:36] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.959 second response time [17:33:37] PROBLEM - HHVM busy threads on mw1121 is CRITICAL 77.78% of data above the critical threshold [86.4] [17:33:37] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 1.059 second response time [17:33:45] PROBLEM - HHVM busy threads on mw1232 is CRITICAL 85.71% of data above the critical threshold [115.2] [17:33:46] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.246 second response time [17:33:46] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 0.113 second response time [17:33:47] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.117 second response time [17:33:52] a single box down and whole site is down ? [17:33:56] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 0.093 second response time [17:33:56] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.633 second response time [17:33:56] RECOVERY - HHVM rendering on mw1128 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.130 second response time [17:33:56] PROBLEM - HHVM queue size on mw1128 is CRITICAL 75.00% of data above the critical threshold [80.0] [17:33:56] PROBLEM - HHVM queue size on mw1130 is CRITICAL 77.78% of data above the critical threshold [80.0] [17:34:00] (03Abandoned) 10Hoo man: helium is down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223837 (owner: 10Hoo man) [17:34:05] we need to fix that SPOF [17:34:05] RECOVERY - HHVM rendering on mw1147 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 8.273 second response time [17:34:06] PROBLEM - HHVM queue size on mw1126 is CRITICAL 77.78% of data above the critical threshold [80.0] [17:34:06] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 0.091 second response time [17:34:06] RECOVERY - HHVM rendering on mw1125 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.132 second response time [17:34:06] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [17:34:06] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 1.801 second response time [17:34:06] PROBLEM - HHVM queue size on mw1144 is CRITICAL 85.71% of data above the critical threshold [80.0] [17:34:07] PROBLEM - HHVM busy threads on mw1125 is CRITICAL 100.00% of data above the critical threshold [86.4] [17:34:07] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.419 second response time [17:34:08] PROBLEM - HHVM queue size on mw1232 is CRITICAL 85.71% of data above the critical threshold [80.0] [17:34:08] PROBLEM - HHVM busy threads on mw1134 is CRITICAL 100.00% of data above the critical threshold [86.4] [17:34:09] PROBLEM - HHVM busy threads on mw1231 is CRITICAL 75.00% of data above the critical threshold [115.2] [17:34:09] PROBLEM - HHVM queue size on mw1193 is CRITICAL 87.50% of data above the critical threshold [80.0] [17:34:10] PROBLEM - HHVM busy threads on mw1148 is CRITICAL 100.00% of data above the critical threshold [86.4] [17:34:20] down again?! [17:34:22] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.039 second response time [17:34:22] PROBLEM - HHVM queue size on mw1199 is CRITICAL 100.00% of data above the critical threshold [80.0] [17:34:22] PROBLEM - HHVM busy threads on mw1138 is CRITICAL 100.00% of data above the critical threshold [86.4] [17:34:23] RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.138 second response time [17:34:23] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.631 second response time [17:34:23] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 0.114 second response time [17:34:34] well, flaky [17:34:35] PROBLEM - HHVM busy threads on mw1129 is CRITICAL 87.50% of data above the critical threshold [86.4] [17:34:35] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [17:34:35] PROBLEM - HHVM busy threads on mw1143 is CRITICAL 100.00% of data above the critical threshold [86.4] [17:34:36] PROBLEM - HHVM busy threads on mw1131 is CRITICAL 100.00% of data above the critical threshold [86.4] [17:34:36] PROBLEM - HHVM queue size on mw1142 is CRITICAL 85.71% of data above the critical threshold [80.0] [17:34:37] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 0.108 second response time [17:34:37] RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.138 second response time [17:34:37] PROBLEM - HHVM busy threads on mw1225 is CRITICAL 75.00% of data above the critical threshold [115.2] [17:34:37] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.593 second response time [17:34:38] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.197 second response time [17:34:38] PROBLEM - HHVM queue size on mw1225 is CRITICAL 44.44% of data above the critical threshold [80.0] [17:34:39] PROBLEM - HHVM busy threads on mw1132 is CRITICAL 100.00% of data above the critical threshold [86.4] [17:34:45] PROBLEM - Apache HTTP on mw1195 is CRITICAL - Socket timeout after 10 seconds [17:34:45] PROBLEM - Apache HTTP on mw1233 is CRITICAL - Socket timeout after 10 seconds [17:34:45] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.123 second response time [17:34:46] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.183 second response time [17:34:46] PROBLEM - HHVM queue size on mw1192 is CRITICAL 37.50% of data above the critical threshold [80.0] [17:34:46] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.061 second response time [17:34:47] PROBLEM - HHVM busy threads on mw1115 is CRITICAL 100.00% of data above the critical threshold [86.4] [17:34:47] PROBLEM - HHVM rendering on mw1228 is CRITICAL - Socket timeout after 10 seconds [17:34:47] PROBLEM - HHVM queue size on mw1119 is CRITICAL 87.50% of data above the critical threshold [80.0] [17:34:48] PROBLEM - HHVM busy threads on mw1233 is CRITICAL 75.00% of data above the critical threshold [115.2] [17:34:50] <_joe_> the site goes down because 'timeout' => 0.5 [17:35:01] :S [17:35:04] :-( [17:35:05] PROBLEM - HHVM busy threads on mw1203 is CRITICAL 77.78% of data above the critical threshold [115.2] [17:35:05] PROBLEM - HHVM queue size on mw1233 is CRITICAL 37.50% of data above the critical threshold [80.0] [17:35:05] PROBLEM - HHVM busy threads on mw1114 is CRITICAL 100.00% of data above the critical threshold [86.4] [17:35:05] PROBLEM - HHVM busy threads on mw1198 is CRITICAL 77.78% of data above the critical threshold [115.2] [17:35:05] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.793 second response time [17:35:05] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.106 second response time [17:35:06] PROBLEM - HHVM queue size on mw1137 is CRITICAL 87.50% of data above the critical threshold [80.0] [17:35:06] PROBLEM - HHVM busy threads on mw1192 is CRITICAL 75.00% of data above the critical threshold [115.2] [17:35:14] We should have removed it from the list [17:35:15] PROBLEM - HHVM busy threads on mw1189 is CRITICAL 71.43% of data above the critical threshold [115.2] [17:35:16] RECOVERY - Apache HTTP on mw1221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.024 second response time [17:35:16] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [17:35:16] PROBLEM - HHVM busy threads on mw1147 is CRITICAL 100.00% of data above the critical threshold [86.4] [17:35:16] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.737 second response time [17:35:16] RECOVERY - HHVM rendering on mw1137 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 1.000 second response time [17:35:16] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [17:35:16] 6operations, 6Services: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#930642 (10cscott) Filed T105372 to implement clean shut down. In the interm, just `service ocg stop` and we'll hope the affected user whose render job hangs doesn't get too upset with us. [17:35:17] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.077 second response time [17:35:17] PROBLEM - HHVM queue size on mw1145 is CRITICAL 100.00% of data above the critical threshold [80.0] [17:35:18] PROBLEM - HHVM busy threads on mw1229 is CRITICAL 71.43% of data above the critical threshold [115.2] [17:35:18] PROBLEM - HHVM busy threads on mw1190 is CRITICAL 75.00% of data above the critical threshold [115.2] [17:35:25] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.728 second response time [17:35:25] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 0.113 second response time [17:35:26] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.146 second response time [17:35:26] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.150 second response time [17:35:26] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.184 second response time [17:35:26] PROBLEM - HHVM queue size on mw1194 is CRITICAL 100.00% of data above the critical threshold [80.0] [17:35:26] PROBLEM - HHVM busy threads on mw1227 is CRITICAL 66.67% of data above the critical threshold [115.2] [17:35:27] PROBLEM - HHVM busy threads on mw1202 is CRITICAL 75.00% of data above the critical threshold [115.2] [17:35:27] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.541 second response time [17:35:28] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.030 second response time [17:35:28] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [17:35:29] just wait [17:35:29] RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.316 second response time [17:35:30] PROBLEM - HHVM queue size on mw1189 is CRITICAL 42.86% of data above the critical threshold [80.0] [17:35:33] helium is powering up [17:35:39] (03PS1) 10Giuseppe Lavagetto: poolcounter: just comment out helium, leave potassium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223840 [17:35:45] PROBLEM - HHVM busy threads on mw1230 is CRITICAL 75.00% of data above the critical threshold [115.2] [17:35:45] PROBLEM - HHVM busy threads on mw1137 is CRITICAL 100.00% of data above the critical threshold [86.4] [17:35:45] RECOVERY - HHVM rendering on mw1124 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.148 second response time [17:35:45] PROBLEM - HHVM queue size on mw1121 is CRITICAL 77.78% of data above the critical threshold [80.0] [17:35:45] PROBLEM - HHVM queue size on mw1147 is CRITICAL 87.50% of data above the critical threshold [80.0] [17:35:46] <_joe_> ok then this ^ is not needed? [17:35:46] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.034 second response time [17:35:46] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 4.216 second response time [17:35:47] RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.035 second response time [17:35:47] PROBLEM - HHVM busy threads on mw1193 is CRITICAL 100.00% of data above the critical threshold [115.2] [17:35:48] RECOVERY - HHVM rendering on mw1126 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.138 second response time [17:35:48] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.137 second response time [17:35:49] PROBLEM - HHVM busy threads on mw1226 is CRITICAL 75.00% of data above the critical threshold [115.2] [17:35:55] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 1.809 second response time [17:35:55] PROBLEM - HHVM busy threads on mw1142 is CRITICAL 100.00% of data above the critical threshold [86.4] [17:35:55] RECOVERY - HHVM rendering on mw1121 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.129 second response time [17:35:56] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 0 below the confidence bounds [17:35:56] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 0.102 second response time [17:35:56] PROBLEM - HHVM queue size on mw1122 is CRITICAL 87.50% of data above the critical threshold [80.0] [17:35:56] PROBLEM - HHVM queue size on mw1136 is CRITICAL 100.00% of data above the critical threshold [80.0] [17:35:56] PROBLEM - HHVM queue size on mw1115 is CRITICAL 87.50% of data above the critical threshold [80.0] [17:35:57] PROBLEM - HHVM busy threads on mw1235 is CRITICAL 75.00% of data above the critical threshold [115.2] [17:35:57] PROBLEM - HHVM queue size on mw1223 is CRITICAL 62.50% of data above the critical threshold [80.0] [17:36:05] yes, no more syncs, wait for it to recover [17:36:16] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.980 second response time [17:36:16] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.032 second response time [17:36:16] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.999 second response time [17:36:17] PROBLEM - HHVM busy threads on mw1200 is CRITICAL 75.00% of data above the critical threshold [115.2] [17:36:17] PROBLEM - HHVM busy threads on mw1228 is CRITICAL 75.00% of data above the critical threshold [115.2] [17:36:17] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.118 second response time [17:36:17] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 0.109 second response time [17:36:17] RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.139 second response time [17:36:17] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.164 second response time [17:36:18] PROBLEM - HHVM queue size on mw1140 is CRITICAL 100.00% of data above the critical threshold [80.0] [17:36:22] <_joe_> ori: agreed [17:36:25] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.042 second response time [17:36:25] PROBLEM - HHVM busy threads on mw1127 is CRITICAL 100.00% of data above the critical threshold [86.4] [17:36:26] PROBLEM - HHVM queue size on mw1235 is CRITICAL 57.14% of data above the critical threshold [80.0] [17:36:26] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [17:36:26] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.039 second response time [17:36:26] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.033 second response time [17:36:26] PROBLEM - HHVM busy threads on mw1234 is CRITICAL 71.43% of data above the critical threshold [115.2] [17:36:27] PROBLEM - HHVM busy threads on mw1197 is CRITICAL 71.43% of data above the critical threshold [115.2] [17:36:27] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [17:36:28] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [17:36:28] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.042 second response time [17:36:29] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [17:36:29] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.153 second response time [17:36:30] PROBLEM - HHVM queue size on mw1123 is CRITICAL 100.00% of data above the critical threshold [80.0] [17:36:35] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 0.086 second response time [17:36:36] RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 0.092 second response time [17:36:36] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 0.094 second response time [17:36:36] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 0.104 second response time [17:36:36] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.216 second response time [17:36:36] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 65020 bytes in 0.116 second response time [17:36:56] PROBLEM - HHVM queue size on mw1129 is CRITICAL 77.78% of data above the critical threshold [80.0] [17:36:57] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.226 second response time [17:36:57] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 5.678 second response time [17:36:57] RECOVERY - Host helium is UPING OK - Packet loss = 0%, RTA = 0.28 ms [17:37:11] <_joe_> ori: let's remove the additional ip though [17:37:15] PROBLEM - HHVM busy threads on mw1208 is CRITICAL 71.43% of data above the critical threshold [115.2] [17:37:15] PROBLEM - HHVM queue size on mw1127 is CRITICAL 75.00% of data above the critical threshold [80.0] [17:37:16] PROBLEM - HHVM queue size on mw1198 is CRITICAL 85.71% of data above the critical threshold [80.0] [17:37:16] PROBLEM - HHVM queue size on mw1203 is CRITICAL 42.86% of data above the critical threshold [80.0] [17:37:16] PROBLEM - HHVM busy threads on mw1223 is CRITICAL 85.71% of data above the critical threshold [115.2] [17:37:18] <_joe_> now that helium is up? [17:37:18] re: potassium, I didn't know about it. I just knew it was packaged / tested on precise and that mw1154 (scaler) was still on precise) [17:37:26] PROBLEM - HHVM busy threads on mw1241 is CRITICAL 42.86% of data above the critical threshold [115.2] [17:37:26] PROBLEM - HHVM queue size on mw1201 is CRITICAL 57.14% of data above the critical threshold [80.0] [17:37:32] is it confirmed up? [17:37:33] <_joe_> oh so you installed it ok [17:37:36] PROBLEM - HHVM queue size on mw1202 is CRITICAL 57.14% of data above the critical threshold [80.0] [17:37:36] PROBLEM - HHVM queue size on mw1197 is CRITICAL 55.56% of data above the critical threshold [80.0] [17:37:42] <_joe_> icinga-wm> RECOVERY - Host helium is UPING OK - Packet loss = 0%, RTA = 0.28 ms [17:37:43] _joe_: yeah i logged it, lost in the noise [17:37:45] PROBLEM - HHVM queue size on mw1230 is CRITICAL 57.14% of data above the critical threshold [80.0] [17:37:45] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 65021 bytes in 0.129 second response time [17:37:46] PROBLEM - HHVM busy threads on mw1191 is CRITICAL 85.71% of data above the critical threshold [115.2] [17:37:46] PROBLEM - HHVM queue size on mw1222 is CRITICAL 62.50% of data above the critical threshold [80.0] [17:37:46] PROBLEM - HHVM busy threads on mw1221 is CRITICAL 75.00% of data above the critical threshold [115.2] [17:37:50] <_joe_> yep sorry didn't see it [17:37:55] PROBLEM - HHVM busy threads on mw1201 is CRITICAL 85.71% of data above the critical threshold [115.2] [17:38:06] PROBLEM - HHVM queue size on mw1234 is CRITICAL 85.71% of data above the critical threshold [80.0] [17:38:08] <_joe_> busy threads and queue sizes are expected in this situation [17:38:25] PROBLEM - HHVM queue size on mw1228 is CRITICAL 85.71% of data above the critical threshold [80.0] [17:38:27] PROBLEM - HHVM queue size on mw1221 is CRITICAL 57.14% of data above the critical threshold [80.0] [17:38:34] yeah, but the site is up [17:38:36] PROBLEM - HHVM queue size on mw1200 is CRITICAL 75.00% of data above the critical threshold [80.0] [17:38:39] <_joe_> yes [17:38:45] PROBLEM - HHVM queue size on mw1208 is CRITICAL 85.71% of data above the critical threshold [80.0] [17:38:45] PROBLEM - HHVM busy threads on mw1224 is CRITICAL 85.71% of data above the critical threshold [115.2] [17:38:46] PROBLEM - HHVM queue size on mw1191 is CRITICAL 50.00% of data above the critical threshold [80.0] [17:38:46] PROBLEM - HHVM queue size on mw1227 is CRITICAL 50.00% of data above the critical threshold [80.0] [17:38:48] <_joe_> it's just icinga being slow at finding out [17:38:56] PROBLEM - HHVM queue size on mw1190 is CRITICAL 71.43% of data above the critical threshold [80.0] [17:38:56] PROBLEM - HHVM queue size on mw1231 is CRITICAL 62.50% of data above the critical threshold [80.0] [17:38:59] well that apparently is a spof [17:39:00] <_joe_> http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1432806086.751&target=reqstats.5xx&from=-15minutes [17:39:03] <_joe_> we're up [17:39:07] <_joe_> cmjohnson1: it is in a sense [17:39:25] RECOVERY - HHVM busy threads on mw1241 is OK Less than 30.00% above the threshold [76.8] [17:39:26] <_joe_> we can't afford a timeout of 0.5 seconds on polling a poolcounter apparently [17:39:34] <_joe_> if it's down [17:39:44] so, that's the same problem essentially we had with redis and logstash [17:39:49] the timeout being too high [17:39:52] yeah i was about to say [17:39:55] <_joe_> akosiaris: yes [17:40:01] good to know cuz we're gonna have to reboot it again at a later date to attach the disk shelf [17:40:04] <_joe_> that's my hypothesis [17:40:10] <_joe_> I should look at the poolcounter code [17:40:15] cmjohnson1: you can reboot it [17:40:16] we're not using it atm [17:40:26] RECOVERY - HHVM queue size on mw1235 is OK Less than 30.00% above the threshold [10.0] [17:40:29] i switched everything over to mw1154 [17:40:33] ori okay give me 10 mins [17:40:35] haven't reverted thaty et [17:40:49] <_joe_> we could just put potassium first maybe [17:40:50] RECOVERY - HHVM queue size on mw1128 is OK Less than 30.00% above the threshold [10.0] [17:40:58] RECOVERY - HHVM queue size on mw1137 is OK Less than 30.00% above the threshold [10.0] [17:41:00] (03PS1) 10RobH: oliver's new ssh key [puppet] - 10https://gerrit.wikimedia.org/r/223841 [17:41:02] <_joe_> what baffles me is [17:41:08] RECOVERY - HHVM queue size on mw1126 is OK Less than 30.00% above the threshold [10.0] [17:41:08] RECOVERY - HHVM queue size on mw1125 is OK Less than 30.00% above the threshold [10.0] [17:41:09] RECOVERY - HHVM queue size on mw1120 is OK Less than 30.00% above the threshold [10.0] [17:41:13] <_joe_> why did the requests hung up? [17:41:37] RECOVERY - HHVM queue size on mw1227 is OK Less than 30.00% above the threshold [10.0] [17:41:39] <_joe_> I would've expected it to find the host was down and use some logic [17:41:42] (03CR) 10RobH: [C: 032] oliver's new ssh key [puppet] - 10https://gerrit.wikimedia.org/r/223841 (owner: 10RobH) [17:41:47] RECOVERY - HHVM queue size on mw1229 is OK Less than 30.00% above the threshold [10.0] [17:41:48] RECOVERY - HHVM queue size on mw1142 is OK Less than 30.00% above the threshold [10.0] [17:41:49] RECOVERY - HHVM queue size on mw1123 is OK Less than 30.00% above the threshold [10.0] [17:41:52] to fallback to potassium ? [17:41:55] <_joe_> after a certain number of timeouts [17:41:56] <_joe_> yes [17:41:57] RECOVERY - HHVM queue size on mw1201 is OK Less than 30.00% above the threshold [10.0] [17:41:58] RECOVERY - HHVM queue size on mw1197 is OK Less than 30.00% above the threshold [10.0] [17:42:05] <_joe_> but it's php, every request is on its own [17:42:07] RECOVERY - HHVM queue size on mw1230 is OK Less than 30.00% above the threshold [10.0] [17:42:08] RECOVERY - HHVM queue size on mw1222 is OK Less than 30.00% above the threshold [10.0] [17:42:09] RECOVERY - HHVM queue size on mw1121 is OK Less than 30.00% above the threshold [10.0] [17:42:09] RECOVERY - HHVM queue size on mw1147 is OK Less than 30.00% above the threshold [10.0] [17:42:09] yeah [17:42:09] RECOVERY - HHVM queue size on mw1200 is OK Less than 30.00% above the threshold [10.0] [17:42:09] RECOVERY - HHVM queue size on mw1221 is OK Less than 30.00% above the threshold [10.0] [17:42:09] RECOVERY - HHVM queue size on mw1231 is OK Less than 30.00% above the threshold [10.0] [17:42:09] RECOVERY - HHVM queue size on mw1115 is OK Less than 30.00% above the threshold [10.0] [17:42:12] we actually have that for redis [17:42:14] <_joe_> so you need some shared memory to do that [17:42:18] RECOVERY - HHVM queue size on mw1139 is OK Less than 30.00% above the threshold [10.0] [17:42:19] RECOVERY - HHVM busy threads on mw1201 is OK Less than 30.00% above the threshold [76.8] [17:42:19] RECOVERY - HHVM queue size on mw1223 is OK Less than 30.00% above the threshold [10.0] [17:42:19] <_joe_> it's unpleasant [17:42:23] you can do apc easily for small amounts of shared info like that too [17:42:27] <_joe_> it can be done, as we do for redis [17:42:33] <_joe_> ebernhardson: that was my thought too [17:42:37] RECOVERY - HHVM busy threads on mw1124 is OK Less than 30.00% above the threshold [57.6] [17:42:37] RECOVERY - HHVM queue size on mw1146 is OK Less than 30.00% above the threshold [10.0] [17:42:38] RECOVERY - HHVM busy threads on mw1120 is OK Less than 30.00% above the threshold [57.6] [17:42:38] RECOVERY - HHVM queue size on mw1202 is OK Less than 30.00% above the threshold [10.0] [17:42:38] RECOVERY - HHVM queue size on mw1198 is OK Less than 30.00% above the threshold [10.0] [17:42:47] RECOVERY - HHVM queue size on mw1144 is OK Less than 30.00% above the threshold [10.0] [17:42:47] RECOVERY - HHVM queue size on mw1232 is OK Less than 30.00% above the threshold [10.0] [17:42:48] RECOVERY - HHVM queue size on mw1138 is OK Less than 30.00% above the threshold [10.0] [17:42:48] so, I'm in LA on a semi-vacation (meaning: vacation if I could pry myself off of IRC) [17:42:53] can I be a jerk and not help with a postmortem? [17:42:57] RECOVERY - HHVM queue size on mw1234 is OK Less than 30.00% above the threshold [10.0] [17:42:57] RECOVERY - HHVM queue size on mw1129 is OK Less than 30.00% above the threshold [10.0] [17:42:57] RECOVERY - HHVM busy threads on mw1144 is OK Less than 30.00% above the threshold [57.6] [17:42:57] RECOVERY - HHVM busy threads on mw1229 is OK Less than 30.00% above the threshold [76.8] [17:42:58] RECOVERY - HHVM queue size on mw1124 is OK Less than 30.00% above the threshold [10.0] [17:42:58] RECOVERY - HHVM busy threads on mw1145 is OK Less than 30.00% above the threshold [57.6] [17:42:58] RECOVERY - HHVM queue size on mw1203 is OK Less than 30.00% above the threshold [10.0] [17:43:01] ori: yeah sure [17:43:02] it looks like we know what happened and how to avoid it [17:43:04] thanks [17:43:08] RECOVERY - HHVM busy threads on mw1228 is OK Less than 30.00% above the threshold [76.8] [17:43:12] <_joe_> ori: go away! [17:43:15] <_joe_> :) [17:43:18] RECOVERY - HHVM busy threads on mw1126 is OK Less than 30.00% above the threshold [57.6] [17:43:18] RECOVERY - HHVM queue size on mw1140 is OK Less than 30.00% above the threshold [10.0] [17:43:18] RECOVERY - HHVM busy threads on mw1123 is OK Less than 30.00% above the threshold [57.6] [17:43:18] RECOVERY - HHVM queue size on mw1127 is OK Less than 30.00% above the threshold [10.0] [17:43:18] RECOVERY - HHVM busy threads on mw1137 is OK Less than 30.00% above the threshold [57.6] [17:43:18] RECOVERY - HHVM queue size on mw1145 is OK Less than 30.00% above the threshold [10.0] [17:43:18] RECOVERY - HHVM busy threads on mw1223 is OK Less than 30.00% above the threshold [76.8] [17:43:19] RECOVERY - HHVM busy threads on mw1230 is OK Less than 30.00% above the threshold [76.8] [17:43:19] RECOVERY - HHVM queue size on mw1226 is OK Less than 30.00% above the threshold [10.0] [17:43:20] RECOVERY - HHVM queue size on mw1194 is OK Less than 30.00% above the threshold [10.0] [17:43:22] <_joe_> enjoy your vacation [17:43:28] RECOVERY - HHVM busy threads on mw1119 is OK Less than 30.00% above the threshold [57.6] [17:43:28] RECOVERY - HHVM busy threads on mw1198 is OK Less than 30.00% above the threshold [76.8] [17:43:28] RECOVERY - HHVM queue size on mw1191 is OK Less than 30.00% above the threshold [10.0] [17:43:37] RECOVERY - HHVM busy threads on mw1129 is OK Less than 30.00% above the threshold [57.6] [17:43:37] RECOVERY - HHVM busy threads on mw1127 is OK Less than 30.00% above the threshold [57.6] [17:43:37] RECOVERY - HHVM busy threads on mw1132 is OK Less than 30.00% above the threshold [57.6] [17:43:37] RECOVERY - HHVM busy threads on mw1222 is OK Less than 30.00% above the threshold [76.8] [17:43:37] RECOVERY - HHVM queue size on mw1190 is OK Less than 30.00% above the threshold [10.0] [17:43:38] RECOVERY - HHVM busy threads on mw1251 is OK Less than 30.00% above the threshold [76.8] [17:43:38] RECOVERY - HHVM busy threads on mw1140 is OK Less than 30.00% above the threshold [57.6] [17:43:39] RECOVERY - HHVM busy threads on mw1143 is OK Less than 30.00% above the threshold [57.6] [17:43:39] RECOVERY - HHVM queue size on mw1228 is OK Less than 30.00% above the threshold [10.0] [17:43:40] RECOVERY - HHVM busy threads on mw1131 is OK Less than 30.00% above the threshold [57.6] [17:43:40] RECOVERY - HHVM busy threads on mw1234 is OK Less than 30.00% above the threshold [76.8] [17:43:41] RECOVERY - HHVM busy threads on mw1197 is OK Less than 30.00% above the threshold [76.8] [17:43:45] but if the appservers fail, how will he use wikivoyage to find the places he wants to visit? :P [17:43:47] RECOVERY - HHVM queue size on mw1132 is OK Less than 30.00% above the threshold [10.0] [17:43:47] RECOVERY - HHVM busy threads on mw1147 is OK Less than 30.00% above the threshold [57.6] [17:43:48] RECOVERY - HHVM queue size on mw1131 is OK Less than 30.00% above the threshold [10.0] [17:43:48] RECOVERY - HHVM busy threads on mw1128 is OK Less than 30.00% above the threshold [57.6] [17:43:48] RECOVERY - HHVM busy threads on mw1121 is OK Less than 30.00% above the threshold [57.6] [17:43:57] <_joe_> ahah [17:43:57] RECOVERY - HHVM busy threads on mw1193 is OK Less than 30.00% above the threshold [76.8] [17:43:58] RECOVERY - HHVM busy threads on mw1191 is OK Less than 30.00% above the threshold [76.8] [17:43:59] RECOVERY - HHVM busy threads on mw1142 is OK Less than 30.00% above the threshold [57.6] [17:43:59] RECOVERY - HHVM busy threads on mw1114 is OK Less than 30.00% above the threshold [57.6] [17:43:59] RECOVERY - HHVM queue size on mw1143 is OK Less than 30.00% above the threshold [10.0] [17:43:59] RECOVERY - HHVM busy threads on mw1224 is OK Less than 30.00% above the threshold [76.8] [17:43:59] RECOVERY - HHVM queue size on mw1225 is OK Less than 30.00% above the threshold [10.0] [17:43:59] RECOVERY - HHVM busy threads on mw1221 is OK Less than 30.00% above the threshold [76.8] [17:44:00] RECOVERY - HHVM queue size on mw1122 is OK Less than 30.00% above the threshold [10.0] [17:44:07] RECOVERY - HHVM queue size on mw1130 is OK Less than 30.00% above the threshold [10.0] [17:44:09] RECOVERY - HHVM queue size on mw1119 is OK Less than 30.00% above the threshold [10.0] [17:44:09] <_joe_> ok, I'm off for now [17:44:17] RECOVERY - HHVM busy threads on mw1146 is OK Less than 30.00% above the threshold [57.6] [17:44:17] RECOVERY - HHVM busy threads on mw1235 is OK Less than 30.00% above the threshold [76.8] [17:44:17] RECOVERY - HHVM busy threads on mw1202 is OK Less than 30.00% above the threshold [76.8] [17:44:27] RECOVERY - HHVM queue size on mw1117 is OK Less than 30.00% above the threshold [10.0] [17:44:28] RECOVERY - HHVM queue size on mw1135 is OK Less than 30.00% above the threshold [10.0] [17:44:28] RECOVERY - HHVM queue size on mw1189 is OK Less than 30.00% above the threshold [10.0] [17:44:28] RECOVERY - HHVM busy threads on mw1233 is OK Less than 30.00% above the threshold [76.8] [17:44:28] RECOVERY - HHVM busy threads on mw1203 is OK Less than 30.00% above the threshold [76.8] [17:44:28] RECOVERY - HHVM busy threads on mw1122 is OK Less than 30.00% above the threshold [57.6] [17:44:29] RECOVERY - HHVM busy threads on mw1194 is OK Less than 30.00% above the threshold [76.8] [17:44:29] RECOVERY - HHVM busy threads on mw1115 is OK Less than 30.00% above the threshold [57.6] [17:44:29] RECOVERY - HHVM busy threads on mw1232 is OK Less than 30.00% above the threshold [76.8] [17:44:38] RECOVERY - HHVM busy threads on mw1125 is OK Less than 30.00% above the threshold [57.6] [17:44:38] RECOVERY - HHVM busy threads on mw1136 is OK Less than 30.00% above the threshold [57.6] [17:44:38] RECOVERY - HHVM busy threads on mw1134 is OK Less than 30.00% above the threshold [57.6] [17:44:38] RECOVERY - HHVM busy threads on mw1231 is OK Less than 30.00% above the threshold [76.8] [17:44:38] RECOVERY - HHVM queue size on mw1114 is OK Less than 30.00% above the threshold [10.0] [17:44:38] RECOVERY - HHVM queue size on mw1193 is OK Less than 30.00% above the threshold [10.0] [17:44:38] RECOVERY - HHVM busy threads on mw1199 is OK Less than 30.00% above the threshold [76.8] [17:44:39] RECOVERY - HHVM queue size on mw1233 is OK Less than 30.00% above the threshold [10.0] [17:44:39] RECOVERY - HHVM busy threads on mw1148 is OK Less than 30.00% above the threshold [57.6] [17:44:40] RECOVERY - HHVM queue size on mw1195 is OK Less than 30.00% above the threshold [10.0] [17:44:48] RECOVERY - HHVM queue size on mw1148 is OK Less than 30.00% above the threshold [10.0] [17:44:48] RECOVERY - HHVM queue size on mw1208 is OK Less than 30.00% above the threshold [10.0] [17:44:48] RECOVERY - HHVM busy threads on mw1190 is OK Less than 30.00% above the threshold [76.8] [17:44:48] RECOVERY - HHVM busy threads on mw1227 is OK Less than 30.00% above the threshold [76.8] [17:44:49] RECOVERY - HHVM busy threads on mw1138 is OK Less than 30.00% above the threshold [57.6] [17:44:49] RECOVERY - HHVM queue size on mw1199 is OK Less than 30.00% above the threshold [10.0] [17:44:57] RECOVERY - HHVM queue size on mw1134 is OK Less than 30.00% above the threshold [10.0] [17:44:57] RECOVERY - HHVM busy threads on mw1116 is OK Less than 30.00% above the threshold [57.6] [17:44:57] RECOVERY - HHVM queue size on mw1192 is OK Less than 30.00% above the threshold [10.0] [17:44:58] RECOVERY - HHVM busy threads on mw1200 is OK Less than 30.00% above the threshold [76.8] [17:45:08] RECOVERY - HHVM queue size on mw1116 is OK Less than 30.00% above the threshold [10.0] [17:45:08] RECOVERY - HHVM busy threads on mw1130 is OK Less than 30.00% above the threshold [57.6] [17:45:19] RECOVERY - HHVM busy threads on mw1195 is OK Less than 30.00% above the threshold [76.8] [17:45:28] RECOVERY - HHVM busy threads on mw1117 is OK Less than 30.00% above the threshold [57.6] [17:45:38] RECOVERY - HHVM busy threads on mw1225 is OK Less than 30.00% above the threshold [76.8] [17:45:48] RECOVERY - HHVM busy threads on mw1226 is OK Less than 30.00% above the threshold [76.8] [17:45:58] RECOVERY - HHVM busy threads on mw1135 is OK Less than 30.00% above the threshold [57.6] [17:45:58] RECOVERY - HHVM busy threads on mw1139 is OK Less than 30.00% above the threshold [57.6] [17:45:58] RECOVERY - HHVM busy threads on mw1189 is OK Less than 30.00% above the threshold [76.8] [17:45:58] RECOVERY - HHVM queue size on mw1136 is OK Less than 30.00% above the threshold [10.0] [17:46:08] RECOVERY - HHVM busy threads on mw1208 is OK Less than 30.00% above the threshold [76.8] [17:46:37] RECOVERY - HHVM busy threads on mw1192 is OK Less than 30.00% above the threshold [76.8] [17:46:37] RECOVERY - HHVM queue size on mw1133 is OK Less than 30.00% above the threshold [10.0] [17:46:38] bblack: wikitravel.org cough .. shhh... [17:46:39] RECOVERY - HHVM busy threads on mw1133 is OK Less than 30.00% above the threshold [57.6] [17:48:20] 6operations, 10Traffic, 7Monitoring: Implement pybal pool state monitoring and alerting via icinga - https://phabricator.wikimedia.org/T102394#1442282 (10Gage) [17:49:19] 6operations, 7Monitoring: icinga log rotation wipes out portions of history - https://phabricator.wikimedia.org/T102397#1442284 (10Gage) [17:50:41] (03CR) 10RobH: "I'm also not certain if the removal from here will result in puppet pulling the files off the client machines (as I would expect) or if I " [puppet] - 10https://gerrit.wikimedia.org/r/223816 (owner: 10RobH) [17:53:48] PROBLEM - Restbase root url on restbase1005 is CRITICAL - Socket timeout after 10 seconds [17:53:58] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [17:54:13] !log bounced restbase on restbase1005 [17:54:19] Logged the message, Master [17:55:29] RECOVERY - Restbase root url on restbase1005 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.006 second response time [17:55:39] chasemp: hi there, jmxtrans is a submodule, right ? [17:55:47] PROBLEM - Host helium is DOWN: PING CRITICAL - Packet loss = 100% [17:56:31] (03PS1) 10Odder: Add www.workwithsounds.eu to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223843 (https://phabricator.wikimedia.org/T105143) [17:56:58] PROBLEM - HHVM rendering on mw2033 is CRITICAL - Socket timeout after 10 seconds [17:56:59] PROBLEM - HHVM rendering on mw2111 is CRITICAL - Socket timeout after 10 seconds [17:56:59] PROBLEM - HHVM rendering on mw2074 is CRITICAL - Socket timeout after 10 seconds [17:56:59] PROBLEM - HHVM rendering on mw2104 is CRITICAL - Socket timeout after 10 seconds [17:56:59] PROBLEM - HHVM rendering on mw2065 is CRITICAL - Socket timeout after 10 seconds [17:56:59] PROBLEM - HHVM rendering on mw2078 is CRITICAL - Socket timeout after 10 seconds [17:56:59] PROBLEM - HHVM rendering on mw2155 is CRITICAL - Socket timeout after 10 seconds [17:57:07] PROBLEM - HHVM rendering on mw2186 is CRITICAL - Socket timeout after 10 seconds [17:57:08] PROBLEM - HHVM rendering on mw2212 is CRITICAL - Socket timeout after 10 seconds [17:57:08] PROBLEM - HHVM rendering on mw2041 is CRITICAL - Socket timeout after 10 seconds [17:57:08] PROBLEM - HHVM rendering on mw2014 is CRITICAL - Socket timeout after 10 seconds [17:57:08] PROBLEM - HHVM rendering on mw2045 is CRITICAL - Socket timeout after 10 seconds [17:57:08] PROBLEM - HHVM rendering on mw2115 is CRITICAL - Socket timeout after 10 seconds [17:57:08] PROBLEM - HHVM rendering on mw2092 is CRITICAL - Socket timeout after 10 seconds [17:57:09] PROBLEM - HHVM rendering on mw2098 is CRITICAL - Socket timeout after 10 seconds [17:57:09] PROBLEM - HHVM rendering on mw2060 is CRITICAL - Socket timeout after 10 seconds [17:57:10] PROBLEM - HHVM rendering on mw2189 is CRITICAL - Socket timeout after 10 seconds [17:57:10] PROBLEM - HHVM rendering on mw2106 is CRITICAL - Socket timeout after 10 seconds [17:57:11] PROBLEM - HHVM rendering on mw2158 is CRITICAL - Socket timeout after 10 seconds [17:57:11] PROBLEM - HHVM rendering on mw2125 is CRITICAL - Socket timeout after 10 seconds [17:58:48] RECOVERY - HHVM rendering on mw2033 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.302 second response time [17:58:48] RECOVERY - HHVM rendering on mw2074 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.300 second response time [17:58:48] RECOVERY - HHVM rendering on mw2111 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.296 second response time [17:58:48] RECOVERY - HHVM rendering on mw2104 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.306 second response time [17:58:48] RECOVERY - HHVM rendering on mw2065 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.301 second response time [17:58:48] RECOVERY - HHVM rendering on mw2078 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.302 second response time [17:58:49] RECOVERY - HHVM rendering on mw2155 is OK: HTTP OK: HTTP/1.1 200 OK - 65471 bytes in 0.284 second response time [17:58:49] RECOVERY - HHVM rendering on mw2041 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.372 second response time [17:58:50] RECOVERY - HHVM rendering on mw2212 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.542 second response time [17:58:50] RECOVERY - HHVM rendering on mw2186 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.772 second response time [17:58:57] RECOVERY - HHVM rendering on mw2014 is OK: HTTP OK: HTTP/1.1 200 OK - 65499 bytes in 0.302 second response time [17:58:57] RECOVERY - HHVM rendering on mw2115 is OK: HTTP OK: HTTP/1.1 200 OK - 65499 bytes in 0.305 second response time [17:58:58] RECOVERY - HHVM rendering on mw2045 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.721 second response time [17:58:58] RECOVERY - Host helium is UPING OK - Packet loss = 0%, RTA = 1.47 ms [17:58:58] RECOVERY - HHVM rendering on mw2098 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.567 second response time [17:58:58] RECOVERY - HHVM rendering on mw2092 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.569 second response time [17:58:58] RECOVERY - HHVM rendering on mw2189 is OK: HTTP OK: HTTP/1.1 200 OK - 65471 bytes in 0.288 second response time [17:58:59] RECOVERY - HHVM rendering on mw2060 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.294 second response time [17:58:59] RECOVERY - HHVM rendering on mw2106 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.297 second response time [17:59:00] RECOVERY - HHVM rendering on mw2125 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.300 second response time [17:59:00] RECOVERY - HHVM rendering on mw2158 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 0.549 second response time [17:59:38] wtf? [18:00:04] twentyafterfour greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150709T1800). Please do the needful. [18:01:23] https://wikitech.wikimedia.org/wiki/Incident_documentation/20150709-poolcounter [18:01:47] for those interested as to what all that icinga-wm noise was [18:02:38] (03PS1) 10Dzahn: ferm rules for nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) [18:02:57] Oh I thought it was the whole site, but fortunately it wasn't :) [18:05:35] 6operations, 10ops-eqiad: db1050 raid degraded - https://phabricator.wikimedia.org/T103110#1442318 (10Cmjohnson) Jaime and Sean, db1050 ...well db1001-db1050 are all out of warranty. I've used up the last stock of 300GB SAS disks. Do you want me to order more or is it time to start planning replacement serve... [18:11:40] (03PS1) 10Matanya: Revert "make mw1154 poolcounter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223847 [18:12:59] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [18:14:00] akosiaris: can https://phabricator.wikimedia.org/T84770 change to public view ? [18:15:16] 6operations, 10ops-eqiad: relocatedhelium to row A and attach new disk shelf - https://phabricator.wikimedia.org/T84770#1442334 (10akosiaris) [18:15:22] matanya: done [18:15:29] thanks [18:15:32] thx [18:15:53] are the fires all out? [18:16:07] twentyafterfour: yup [18:16:25] ok I'm gonna deploy the train and hope that doesn't start any new ones [18:16:48] 6operations: Ferm rules for backup roles - https://phabricator.wikimedia.org/T104996#1442353 (10Dzahn) a backup::host uses: 9102/tcp for bacula-fd a backup::director uses: (director role is combined with storage role, see helium) 9101/tcp for bacula-dir 9102/tcp for bacula-fd 9103/tcp for bacula-sd a backup... [18:17:19] 6operations: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1442355 (10Matanya) 3NEW [18:17:56] akosiaris: I hope it is ok i transform actionables into phab tickets [18:18:07] matanya: obviously [18:18:10] thanks [18:18:48] 6operations: Ferm rules for backup roles - https://phabricator.wikimedia.org/T104996#1442364 (10Dzahn) backup::host's are already covered when including base::firewall. see line 44 of role/backup.pp [18:19:24] 6operations, 6Multimedia, 6Performance-Team, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1442365 (10Krinkle) p:5Triage>3Normal [18:20:32] (03CR) 10Dzahn: [C: 04-1] "needs https://gerrit.wikimedia.org/r/#/c/223844/ and maybe more" [puppet] - 10https://gerrit.wikimedia.org/r/223244 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [18:21:47] 6operations: Revert mw1154 from being a poolcounter after helium is deemed fine again - https://phabricator.wikimedia.org/T105379#1442373 (10Matanya) 3NEW [18:23:34] 6operations: Remove poolcounter from mw1154 for housecleaning - https://phabricator.wikimedia.org/T105380#1442381 (10Matanya) 3NEW [18:23:55] 6operations: Revert mw1154 from being a poolcounter after helium is deemed fine again - https://phabricator.wikimedia.org/T105379#1442388 (10Matanya) [18:23:56] 6operations: Remove poolcounter from mw1154 for housecleaning - https://phabricator.wikimedia.org/T105380#1442381 (10Matanya) [18:24:03] (03PS1) 10Dzahn: ferm rules for bacula director [puppet] - 10https://gerrit.wikimedia.org/r/223849 (https://phabricator.wikimedia.org/T104996) [18:24:04] all done here [18:24:16] (03PS1) 1020after4: all wikis to 1.26wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223850 [18:24:28] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:25:00] (03CR) 1020after4: [C: 032] all wikis to 1.26wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223850 (owner: 1020after4) [18:25:02] (03PS2) 10Matanya: Revert "make mw1154 poolcounter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223847 [18:25:07] (03Merged) 10jenkins-bot: all wikis to 1.26wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223850 (owner: 1020after4) [18:27:24] (03PS1) 10Dzahn: ferm fules for bacula storage [puppet] - 10https://gerrit.wikimedia.org/r/223851 (https://phabricator.wikimedia.org/T104996) [18:27:45] (03PS2) 10Dzahn: ferm rules for bacula storage [puppet] - 10https://gerrit.wikimedia.org/r/223851 (https://phabricator.wikimedia.org/T104996) [18:28:02] 6operations, 5Patch-For-Review: Ferm rules for backup roles - https://phabricator.wikimedia.org/T104996#1442404 (10Dzahn) bacula-director: https://gerrit.wikimedia.org/r/#/c/223849/1/manifests/role/backup.pp bacula-storage: https://gerrit.wikimedia.org/r/#/c/223851/2/manifests/role/backup.pp please review [18:28:16] 6operations, 5Patch-For-Review: Ferm rules for backup roles - https://phabricator.wikimedia.org/T104996#1442405 (10Dzahn) [18:29:28] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: all wikis to 1.26wmf13 [18:29:33] Logged the message, Master [18:32:28] akosiaris: please open this one too : https://phabricator.wikimedia.org/T83729 [18:44:11] (03PS1) 10Alex Monk: TitleBlacklist: Don't block account auto-creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223859 [18:44:15] (03Abandoned) 10Alex Monk: Block WMF account creation by users who aren't already WMF tagged [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223715 (owner: 10Alex Monk) [18:47:22] (03CR) 10Jforrester: [C: 031] TitleBlacklist: Don't block account auto-creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223859 (owner: 10Alex Monk) [18:48:19] (03CR) 10Legoktm: [C: 031] TitleBlacklist: Don't block account auto-creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223859 (owner: 10Alex Monk) [18:48:51] (03CR) 10Andrew Bogott: [C: 031] ldap: Allow projects to override user's loginshells [puppet] - 10https://gerrit.wikimedia.org/r/223828 (https://phabricator.wikimedia.org/T102395) (owner: 10Yuvipanda) [18:52:03] do we have network topology information in etcd? or elsewhere? E.g. could I theoretically determine which app servers share the same server rack (or more specifically, I'd like to group servers by which network switch they are attached to) [18:54:47] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1442528 (10Eevans) It would appear that with `gc_grace_seconds` set to 0, and deletes happening at consistency ONE, that some deletes ar... [18:54:51] (03PS1) 10Dzahn: ferm rules for IRCd [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) [18:57:14] Greg-g csteipp OAuth is broken on my app [18:57:56] (03PS1) 10Dzahn: argon: add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/223887 (https://phabricator.wikimedia.org/T104943) [18:59:38] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [19:00:07] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [19:00:54] ragesoss: do you have any details on what's wrong? /me looks for a phab ticket [19:02:54] 6operations: Track source of packages in repro - https://phabricator.wikimedia.org/T105385#1442536 (10Andrew) 3NEW [19:02:59] jgage: ^ [19:03:07] cool :) [19:03:47] bd808 working on a ticket. [19:04:43] !log bounced cassandra on restbase1004 [19:04:46] The response after clicking 'allow' has changed in a way that my app doesn't handle. Within the last two hours. [19:04:48] Logged the message, Master [19:05:28] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [19:05:53] ragesoss: we just rolled 1.26wmf13 out to the wikipedias today so it's likely that the change is in that branch [19:07:38] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.001 second response time on port 9042 [19:07:59] bd808: it's a Json parse error with an unexpected # of segments. [19:08:38] ragesoss: The changes in OAuth itself are very minor so likely it's something in core -- https://github.com/wikimedia/mediawiki-extensions-OAuth/compare/wmf/1.26wmf12...wmf/1.26wmf13 [19:08:54] ragesoss: can you get a dump of the json response? [19:09:32] (03CR) 10Dzahn: [C: 031] "IRCD already listening on v6 but we don't really enable it..because of this" [dns] - 10https://gerrit.wikimedia.org/r/214506 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [19:11:17] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [19:12:15] bd808 will work on it. [19:29:39] Coren: how is the automated backup going? [19:31:42] (03PS5) 1020after4: Add interwiki-labs.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175755 (https://phabricator.wikimedia.org/T69931) (owner: 10Reedy) [19:32:57] andrewbogott: crapy-ish. I'm about to push a new version to test. [19:33:05] (03CR) 1020after4: [C: 031] Add interwiki-labs.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175755 (https://phabricator.wikimedia.org/T69931) (owner: 10Reedy) [19:33:40] Coren: ok. Let me know when you give up and I’ll make a by-hand backup just so we have one. I could use the practice anyway. [19:33:45] um… if/when :) [19:34:02] Yeah, I was about to say that I loved your vote of optimism. :-P [19:34:14] bd808: there's a moved permanently response involved. [19:34:59] ragesoss: to https? [19:37:36] Is Cassandra broken? [19:37:51] https://pt.wikipedia.org/api/rest_v1/page/html/Usu%C3%A1rio(a)%3AKrenair_(WMF)%2Fsandbox/42820644 gave me an Error in Cassandra storage backend just now [19:41:05] bd808: maybe? [19:42:19] (03CR) 10Andrew Bogott: [C: 04-1] "One question inline about your auth uri" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [19:44:13] (03CR) 10Andrew Bogott: [C: 032] nodepool: provide openstack env variables to system user [puppet] - 10https://gerrit.wikimedia.org/r/220444 (https://phabricator.wikimedia.org/T103673) (owner: 10Hashar) [19:44:35] bd808: the example OAuth app is broken too [19:45:28] ragesoss: https://tools.wmflabs.org/oauth-hello-world/ or something else? [19:45:37] That one [19:46:08] (03CR) 10Andrew Bogott: [C: 032] nodepool: element to prepare an image for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/220445 (owner: 10Hashar) [19:46:10] Krenair, hang out in #wikimedia-services .. they have been working on it. [19:46:51] (03CR) 10Andrew Bogott: [C: 032] nodepool: element with basic networking packages [puppet] - 10https://gerrit.wikimedia.org/r/223777 (https://phabricator.wikimedia.org/T105152) (owner: 10Hashar) [19:47:18] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [19:48:08] (03CR) 10Andrew Bogott: [C: 032] nodepool: add guest disk image utilities [puppet] - 10https://gerrit.wikimedia.org/r/223543 (owner: 10Hashar) [19:48:53] bd808: logging in gives no error visible to the user, but the identify and other features then show errors on that hello world app [19:49:24] ragesoss: looking now [19:50:11] bd808, ragesoss: sounds like https://gerrit.wikimedia.org/r/#/c/219446/ [19:50:31] it does sound very much like that [19:51:08] andrewbogott: can paste a list of packages ? [19:51:12] in https://phabricator.wikimedia.org/T105385 [19:51:57] matanya: If you want :) That bug is more of a suggestion that we create a system rather than a call for immediate documentation [19:52:45] andrewbogott: it would be easier for me to undersand the scale and see what can be done with a list [19:53:01] sure, sounds reasonable [19:53:09] !log manually fixing global merge of Yuvipanda->YuviPanda (T104686) [19:53:13] Logged the message, Master [19:55:11] legoktm: i guess you didn't get to the lost user page i poked you about [19:55:46] no :( [19:55:54] Aha! Maps complete. [19:56:47] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:57:32] Coren: are we enabling just /data/project or all 4? [19:57:34] For maps [19:57:51] ori: do you have a moment to help me understand the wonder that is wmf nutcracker config? Or if you didn’t write that… do you know who did? [19:58:12] YuviPanda: There's potentially a lot of things in /home they might need for now. We'll need to sit down and figure out an alternative for them later I suppose. [19:58:33] Ok [20:00:58] * Coren checks the actual project instances. [20:01:01] ragesoss: anomie is looking into the oauth problem [20:02:18] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 14 data above and 8 below the confidence bounds [20:07:19] 6operations, 6Discovery, 10Maps, 6Services, 3Discovery-Maps-Sprint: Puppetize Kartotherian for maps deployment - https://phabricator.wikimedia.org/T105074#1442760 (10Deskana) [20:07:28] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1442761 (10Deskana) [20:08:18] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1435790 (10Deskana) [20:09:25] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1435790 (10Deskana) [20:09:52] mutante: is tls 1.2 supported on lucid ? [20:10:02] (03CR) 10Hashar: nodepool: preliminary role and config file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [20:10:23] (03PS17) 10Hashar: nodepool: preliminary role and config file [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) [20:15:29] 6operations, 6Multimedia, 6Performance-Team, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1442794 (10Gilles) These are the sizes currently used by Media Viewer: 320, 800, 1024, 1280, 1920, 2560, 2880 Adding any of t... [20:17:14] (03PS4) 10Hashar: nodepool: provide openstack env variables to system user [puppet] - 10https://gerrit.wikimedia.org/r/220444 (https://phabricator.wikimedia.org/T103673) [20:19:38] 6operations, 7network: Establish IPsec tunnel between codfw and eqiad pfw - https://phabricator.wikimedia.org/T89294#1442810 (10Jgreen) [20:19:49] (03CR) 10Hashar: [C: 031] "Rebased on top of https://gerrit.wikimedia.org/r/#/c/201728/17 which uses OpenStack auth_uri from our general config instead of auth_url" [puppet] - 10https://gerrit.wikimedia.org/r/220444 (https://phabricator.wikimedia.org/T103673) (owner: 10Hashar) [20:47:58] !log temporarily disabled puppet on cassandra nodes while tweaking settings [20:48:02] Logged the message, Master [20:48:35] (03PS1) 10GWicke: Reduce read concurrency back to 32 [puppet] - 10https://gerrit.wikimedia.org/r/223957 [20:49:57] PROBLEM - puppet last run on cp3016 is CRITICAL puppet fail [20:50:02] (03CR) 10Eevans: [C: 031] Reduce read concurrency back to 32 [puppet] - 10https://gerrit.wikimedia.org/r/223957 (owner: 10GWicke) [20:50:10] (03PS2) 10Dduvall: contint: Install chromedriver for running MW-Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/223691 (https://phabricator.wikimedia.org/T103039) [20:50:53] hashar: ^ [20:51:17] (03PS18) 10Andrew Bogott: nodepool: preliminary role and config file [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [20:52:18] i've noticed a lot of conditionals in ops/puppet for operatingsystem and os_version(...). would it make sense to add these to the hiera hierarchy? [20:52:31] (03CR) 10Andrew Bogott: [C: 032] nodepool: preliminary role and config file [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [20:53:16] maybe ::lsbdistid and ::lsbdistcodename? [20:53:39] (03PS3) 10Andrew Bogott: nodepool: add guest disk image utilities [puppet] - 10https://gerrit.wikimedia.org/r/223543 (owner: 10Hashar) [20:54:20] (03PS5) 10Andrew Bogott: nodepool: provide openstack env variables to system user [puppet] - 10https://gerrit.wikimedia.org/r/220444 (https://phabricator.wikimedia.org/T103673) (owner: 10Hashar) [20:55:23] (03CR) 10Andrew Bogott: [C: 032] nodepool: add guest disk image utilities [puppet] - 10https://gerrit.wikimedia.org/r/223543 (owner: 10Hashar) [20:55:41] (03CR) 10Andrew Bogott: [C: 032] nodepool: provide openstack env variables to system user [puppet] - 10https://gerrit.wikimedia.org/r/220444 (https://phabricator.wikimedia.org/T103673) (owner: 10Hashar) [20:55:50] (03PS6) 10Andrew Bogott: nodepool: provide openstack env variables to system user [puppet] - 10https://gerrit.wikimedia.org/r/220444 (https://phabricator.wikimedia.org/T103673) (owner: 10Hashar) [20:55:59] wtf [20:56:03] anyone doing merge user stuff [20:56:04] legoktm: ? [20:56:11] hoo|busy: yes [20:56:20] why? [20:56:31] db1049 is suffering [20:56:32] badly [20:56:51] (03CR) 10Andrew Bogott: [C: 032] nodepool: provide openstack env variables to system user [puppet] - 10https://gerrit.wikimedia.org/r/220444 (https://phabricator.wikimedia.org/T103673) (owner: 10Hashar) [20:56:51] s5 might be go nuts about this [20:57:06] I just finished the dewiki merge of YuviPanda [20:57:07] marxarelli: I'm not sure what you mean with 'hiera hierarchy', but facter provides $::lsbdistid and $::lsbdistcodename? [20:57:12] which is s5 yeah [20:57:18] yikes [20:57:32] (03PS4) 10Andrew Bogott: nodepool: element to prepare an image for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/220445 (owner: 10Hashar) [20:57:41] andrewbogott_afk: \O/ I am wondering what will happen on labnodepool [20:57:41] I bet it's the unindexed flagged revs update [20:58:07] it is [20:58:39] (03CR) 10Andrew Bogott: [C: 032] nodepool: element to prepare an image for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/220445 (owner: 10Hashar) [20:58:58] (03PS2) 10Andrew Bogott: nodepool: element with basic networking packages [puppet] - 10https://gerrit.wikimedia.org/r/223777 (https://phabricator.wikimedia.org/T105152) (owner: 10Hashar) [20:59:05] crap [20:59:44] that table is freaking huge [20:59:58] and db1049 is a rather old host with spinning disks [21:00:51] twentyafterfour: greg-g: is the MW train over? [21:01:03] legoktm: Any idea how long that will take? [21:01:08] Worth depooling? [21:01:12] I need an emergency deploy window for https://gerrit.wikimedia.org/r/#/c/223952/ [21:01:30] hoo: the master query already finished [21:01:45] Master is different [21:01:48] but yeah, ok [21:01:56] that means it's probably not take forevert [21:02:04] oh, it's until 13 PDT, not 14 [21:02:07] https://tendril.wikimedia.org/host/view/db1049.eqiad.wmnet/3306 says 13m [21:02:09] so it probably is then [21:02:22] tgr: twentyafterfour has been idle on tin for 2.5h so I think you're clear [21:02:53] the cool new thing (now that I know about it) is that gerrit will make the submodule bump for you as soon as you merge the backport [21:03:53] bd808: yeah, that just caused me 20 mins of head scratching in the last SWAT [21:04:11] (03CR) 10Hashar: nodepool: preliminary role and config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [21:04:13] hoo: it's still going up...15m now. if you think we should depool it, lets do that. [21:04:18] heh. It made me yell at many.bubbles yesterday [21:04:32] I got all these empty commit errors and thought my submodules are screwed up somehow [21:04:58] it is a great feature but should have been announced more widely [21:05:48] !log backporting https://gerrit.wikimedia.org/r/#/c/223952/- fixes OAuth which is broken for 1.26wmf13 [21:05:52] Logged the message, Master [21:06:06] (03PS1) 10Hashar: nodepool: actually set $novaconfig [puppet] - 10https://gerrit.wikimedia.org/r/223960 [21:06:17] PROBLEM - puppet last run on labnodepool1001 is CRITICAL puppet fail [21:06:43] (03CR) 10Hashar: "I forgot to set $novaconfig in the role scope :D https://gerrit.wikimedia.org/r/#/c/223960/1/manifests/role/nodepool.pp,unified" [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [21:06:54] legoktm: It's done [21:07:00] was about to depool :D [21:07:00] andrewbogott: I made a trivial puppet mistake :-/ [21:07:07] RECOVERY - puppet last run on cp3016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:07:10] ok [21:10:53] tgr: the train only takes about 5 minutes these days :) (assuming nothing goes wrong) [21:12:57] (03CR) 10Andrew Bogott: [C: 032] nodepool: actually set $novaconfig [puppet] - 10https://gerrit.wikimedia.org/r/223960 (owner: 10Hashar) [21:13:19] (and excluding Tuesday's new branch which takes much more than 5 minutes) [21:15:48] PROBLEM - Incoming network saturation on labstore1001 is CRITICAL 10.71% of data above the critical threshold [100000000.0] [21:20:34] valhallasw`cloud: oh, i meant add %{::lsbdistid} and %{::lsbdistcodename} to hiera.yaml [21:21:01] so that hiera resolves data bindings based on the OS and version from yaml files [21:21:24] Ah! That's an interesting idea [21:22:16] or maybe '%{::lsbdistid}' and '%{::lsbdistid}/%{::lsbdistcodename}' [21:23:10] tgr: need any help? [21:23:29] I'm getting there :) [21:23:55] the deploy docs on extension security patches seem entirely outdated [21:24:09] andrewbogott: maybe you forgot to merge on palladium ? labnodepool still complains [21:24:10] looks like git is configured to do a rebase on submodule update now [21:24:18] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [21:24:39] hashar: nope, merged… [21:24:45] did I miss a patch? [21:25:46] tgr: seems like there have been lots of little nice changes that haven't been well documented/publicized yet [21:28:35] andrewbogott: https://gerrit.wikimedia.org/r/#/c/223960/1/manifests/role/nodepool.pp,unified that one maybe [21:28:52] andrewbogott: but if it is merged on palladium, it did not get rid of the puppet error :/ [21:29:44] andrewbogott: no matter, that is a different one now [21:30:04] !log tgr Synchronized php-1.26wmf13/extensions/OAuth/api/MWOAuthAPI.setup.php: no canonical redirects for requests with OAuth headers (duration: 00m 12s) [21:30:09] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:30:09] Logged the message, Master [21:31:14] andrewbogott: Could not find declared class ::nodepool at /etc/puppet/manifests/role/nodepool.pp:25 [21:31:32] andrewbogott: because class role::nodepool invokes the class { '::nodepool': } [21:31:47] andrewbogott: seems it refers to self (role:nodepool) instead of the module class grblblblbl [21:32:51] andrewbogott: unmerged on strontrium [21:33:06] bblack: how did that happen? [21:33:56] the puppet-merge script fails like that sometimes, it's not uncommon if you move from C2/merge -> puppet-merge in another window very quickly, as in before the first even finishes loading in the browser [21:34:44] basically gerrit is a little slow, and not perfectly in sync either. I think it has sort of an eventual-consistent of repo state between java threads for different clients :) [21:35:00] hm, so I see. hashar, try now? [21:35:37] Could not find declared class ::nodepool at /etc/puppet/manifests/role/nodepool.pp:25 :( [21:35:48] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:36:04] but in the role::nodepool class, the nodepool module class is called with :: [21:36:15] so it should do a global lookup [21:36:53] just try again now or in another minute? [21:38:03] oh, the latest change was just a config variable update... the classes should have existed before all that [21:38:36] twentyafterfour: can you verify that the submodule update with security patches part of https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Trying_tin.27s_code_on_testwiki is unnecessary now? [21:38:42] should the puppetmaster be restarted? [21:39:09] not usually [21:39:45] bblack: Does Varnish cache 301s? [21:40:29] we pushed a fix for an OAuth bug caused by URL canonicalization and ragesoss says he's still seeing 301 responses that I'm not seeing [21:43:32] bblack: just to be on the safe side, could you ban anything that has Special:OAuth/identify in it? [21:44:40] andrewbogott: so I think the puppetmaster got to be kicked. It apparently can't find the new nodepool module for some reason [21:45:15] andrewbogott: maybe because inside role::nodepool , it optimizes '::nodepool' and think it already knows about it, thus skips loading the actual module. Bam class not found [21:46:03] (03CR) 10BBlack: [C: 04-1] "It won't remove it like it is now, but you can do it "right" in two steps, first set is as "ensure => absent", then later remove the whole" [puppet] - 10https://gerrit.wikimedia.org/r/223816 (owner: 10RobH) [21:46:12] I can’t imagine. I feel like I’ve seen this behavior before but don’t remember what caused it. Let me google a bit. [21:53:27] (03PS1) 10Andrew Bogott: Rename nodepool role to nodepool::server role [puppet] - 10https://gerrit.wikimedia.org/r/223969 [21:54:29] (03CR) 10Andrew Bogott: [C: 032] Rename nodepool role to nodepool::server role [puppet] - 10https://gerrit.wikimedia.org/r/223969 (owner: 10Andrew Bogott) [21:55:20] hashar: well, ^ didn’t help [21:55:28] :-((((((((((((((((((( [21:55:33] so, best to set up something in labs and figure out how to make this behave :( [21:55:52] (03PS1) 10Andrew Bogott: Revert "Rename nodepool role to nodepool::server role" [puppet] - 10https://gerrit.wikimedia.org/r/223970 [21:56:38] andrewbogott: have you tried restarting the puppetmaster or does that has too much impact on prod? [21:56:45] (03CR) 10Andrew Bogott: [C: 032] Revert "Rename nodepool role to nodepool::server role" [puppet] - 10https://gerrit.wikimedia.org/r/223970 (owner: 10Andrew Bogott) [21:58:07] hashar: tried, no dice [21:58:16] :-( [21:58:37] I will try on labs tomorrow [22:01:47] (03CR) 10Hashar: "For what it is worth, my lame attempt at reusing operations/puppet.gt with Vagrant was at : https://github.com/hashar/vagrantwmflabs No" [puppet] - 10https://gerrit.wikimedia.org/r/212294 (owner: 10BryanDavis) [22:02:20] (03PS3) 10Giuseppe Lavagetto: Revert "make mw1154 poolcounter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223847 (owner: 10Matanya) [22:02:47] PROBLEM - puppet last run on cp3036 is CRITICAL Puppet has 1 failures [22:03:56] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "make mw1154 poolcounter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223847 (owner: 10Matanya) [22:04:29] PROBLEM - puppet last run on db2056 is CRITICAL Puppet has 1 failures [22:04:41] andrewbogott: heading to bed. Thanks for all the merges! [22:04:55] more stuff will come though [22:05:47] PROBLEM - puppet last run on labvirt1001 is CRITICAL Puppet has 1 failures [22:06:57] PROBLEM - puppet last run on mw2070 is CRITICAL Puppet has 1 failures [22:06:58] PROBLEM - puppet last run on mw1253 is CRITICAL Puppet has 1 failures [22:07:57] PROBLEM - puppet last run on tmh1001 is CRITICAL Puppet has 1 failures [22:08:08] PROBLEM - puppet last run on mw2077 is CRITICAL Puppet has 1 failures [22:08:18] PROBLEM - puppet last run on mw1075 is CRITICAL Puppet has 1 failures [22:09:07] !log oblivian Synchronized wmf-config/PoolCounterSettings-eqiad.php: I don't think we want to keep poolcounter running on an imagescaler (duration: 00m 12s) [22:09:08] PROBLEM - puppet last run on mw2182 is CRITICAL Puppet has 1 failures [22:09:10] Logged the message, Master [22:11:43] (03PS1) 10BBlack: Remove wap and mobile subdomains [dns] - 10https://gerrit.wikimedia.org/r/223972 (https://phabricator.wikimedia.org/T104942) [22:12:11] (03PS1) 10EBernhardson: Make jq available on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/223974 [22:13:10] matanya: no, pretty sure no. needs OpenSSL >= 1.0.1 or GnuTLS [22:14:05] tgr: can you give me an example of a 'Special:OAuth/identify' URL just to make sure I don't misunderstand? [22:14:27] (what hosts/wikis is this on, what paths, ?) [22:14:29] bblack: https://phabricator.wikimedia.org/T105387#1443192 [22:14:50] on every wiki [22:15:11] thanks mutante [22:15:12] just banning URLs containing title=Special:OAuth/identify should be enough [22:15:19] wait [22:15:31] what's that last bit there about "which ignores Cache-Control: private and does not vary on the Authorization header" ? [22:15:47] yeah, we are still trying to figure out [22:15:57] looks like Varnish should not cache this, but it does [22:16:04] if varnish is going to cache some kind of auth[nz] results and share them between users globally, it might be best to sort that out before going any further... [22:16:28] RECOVERY - puppet last run on mw1253 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [22:16:32] have you looked at what other code is sending for Cache-Control headrs? generally varnish is pretty good about that stuff, you just have to know which headers are the right ones to send. [22:16:48] PROBLEM - puppet last run on mw2054 is CRITICAL Puppet has 1 failures [22:16:56] is the oauth stuff actually useful for any kind of auth[nz] yet? [22:17:08] RECOVERY - puppet last run on labvirt1001 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [22:17:18] RECOVERY - puppet last run on tmh1001 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [22:17:25] I have like, zero context on this project, but the idea that auth-related results are getting cached sounds Bad [22:17:26] yes. very useful and the reason we need to figure out how to fix [22:17:26] bd808 just dug up Special:OAuth/identify which looks like it would disable caching on cache-control:private [22:17:38] RECOVERY - puppet last run on mw2077 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [22:17:38] ut that does not seem to work in practice [22:17:41] bd808: I don't mean useful, I mean "is in use, affects live users right now" [22:17:48] RECOVERY - puppet last run on db2056 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:17:49] RECOVERY - puppet last run on cp3036 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:18:00] OAuth is in production and widely used [22:18:11] WikiEducation uses it, for example [22:18:18] RECOVERY - puppet last run on mw2070 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [22:18:20] a bunch of Commons tools use [22:18:22] So https://gerrit.wikimedia.org/r/#/c/219446/ introduced URL canonicalization [22:18:23] ...it [22:18:30] Phabricator uses it for logins [22:18:38] RECOVERY - puppet last run on mw2182 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [22:18:45] 6operations: Track source of packages in repro - https://phabricator.wikimedia.org/T105385#1443278 (10Dzahn) when built from upstream source they should be in component universe. if we made them all ourselves they should be in component main. i think they should always be in gerrit unless there is a really good... [22:19:01] and, varnish is caching dynamic per-user results with auth-y things in them and handing them out to other users? [22:19:02] and https://gerrit.wikimedia.org/r/#/c/223961/ fixes that to ignore this url [22:19:38] it appears to have cached a bare 301 telling the client to hit a different form of the URL [22:19:38] RECOVERY - puppet last run on mw1075 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:19:54] bblack: I was under the impression that we use cache-control to regulate what can and cannot be cached [22:19:59] yes, we do [22:20:12] not just we, it's a standard [22:20:12] and here the response is cached despite cache-control:private [22:20:19] ragesoss just said it's working for him now :/ [22:20:20] 6operations: Track source of packages in repro - https://phabricator.wikimedia.org/T105385#1443289 (10Dzahn) sometimes we have this mixed up for different versions of the same package, example: http://apt.wikimedia.org/wikimedia/pool/main/a/adminbot/ vs. http://apt.wikimedia.org/wikimedia/pool/universe/a/adm... [22:20:24] well, the 301 is, not sure about the actual response [22:21:02] tgr: just confirmed that http://dashboard-testing.wikiedu.org/ is working now [22:21:12] hmmmm ok [22:21:14] so maybe it fell out of cache organically? [22:22:30] bblack: I can verify that the actual response is not cached but the redirect is [22:23:14] or was... the way the redirect was sent it wouldn't have have the cache-control:private header either [22:23:17] that's not a security issue (at least in this case definitely not) but it does break the clients [22:23:46] enwiki seems to be fixed now just by waiting [22:23:46] ok [22:25:29] if the 301's are from the canonical-redirect, it has: [22:25:29] » » » $output->setSquidMaxage( 1200 ); 310 [22:25:30] » » » $output->redirect( $targetUrl, '301' ); [22:25:49] so I would imagine varnish is caching those for 20 minutes [22:26:13] we can wait for that to expire normally [22:26:52] it's been almost an hour since the fix was pushed [22:27:01] so cache control is added by Varnish? [22:27:21] becuse the 301 responses have cache-control:private on them [22:27:25] that topic is complicated, depends which cache control you mean [22:27:46] generally speaking, varnish tends to to do the right thing, it's matter of understanding the rather arcane standards [22:28:05] well it's clearly not, in this case [22:28:09] and understanding that the cache-control emitted from wiki->varnish controls varnish, and varnish emits a different one for varnish->user [22:28:26] tgr: I think you'll need to show substantive proof of that with an example [22:29:15] as I just pasted in code above, it seems our canonicalization redirects (the one generating the bad 301 for ~) explicitly ask varnish to cache for 20 minutes [22:29:17] bblack: https://phabricator.wikimedia.org/P929 [22:29:33] the second request should not be redirected [22:30:45] and by "the right thing" I mean "what we need" :) [22:31:01] sure, but I'm arguing "what you need" is different from "what you asked varnish to do" :) [22:31:28] probably so [22:31:35] how can we fix it? [22:31:59] can we vary on the Authorization header? [22:31:59] a proof of varnish misbehaving in regards to cache-control would have to include requests that bypass varnish to show "this is the cache-control + other details varnish fetched, and then separately this is the public side not obeying them" [22:32:08] RECOVERY - puppet last run on mw2054 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [22:32:20] tgr: ignoring the Authorization issue for the moment.... [22:32:50] the canonicalization URLs being sent by MediaWiki seem to be asking varnish explicitly for 20-minute caching. I don't imagine we want to drop that in general case. [22:33:00] that just means it was going to take 20 minutes to see full effect on your fix [22:33:27] varnish seems to do as told, it's just strange that it puts cache-control: private on the 301 responses while it actually caches them [22:34:05] cache-control from mw->varnish is completely separate from cache-control from varnish->browser [22:34:12] but that's not an issue, it just had me confused for a while [22:34:17] it's expected we have different policies there (one we can control, one we cannot) [22:34:18] PROBLEM - puppet last run on es2007 is CRITICAL puppet fail [22:34:51] and cache-control itself is hugely complicated. I can never remember all of how it behaves without referring to references [22:35:09] the problem is that no canonicalization should happen for requests with an "Authrization: OAuth" header, and if we cache 301s that cannot be guaranteed [22:35:25] and browser versions and protocols... c-c is a mess in reality [22:35:33] so as far as I can see we need to vary on that header [22:36:29] well [22:36:53] 6operations: Ferm rules for netmon1001 - https://phabricator.wikimedia.org/T105410#1443333 (10Dzahn) [22:37:16] 6operations: Ferm rules for netmon1001 - https://phabricator.wikimedia.org/T105410#1443337 (10Dzahn) [22:37:30] I'm not sure that means "Vary: Authorization" is the answer. You could also say the code that does MW-level canonicalization redirects should exclude some subset of the URL space that includes your OAuth requests, or something like that... [22:37:55] also this is relevant: [22:37:56] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.8 [22:40:08] bblack: what's the relevance? [22:40:32] that varnish would not cache if must-revalidate wasn't set? [22:41:47] bblack@strontium:~$ curl -sv 'http://en.wikipedia.org/w/index.php?title=Foo' --resolve 'en.wikipedia.org:80:10.2.2.1' 2>&1 >/dev/null |egrep '301|Locat|Cache' [22:41:50] < HTTP/1.1 301 Moved Permanently [22:41:53] < Cache-control: s-maxage=1200, must-revalidate, max-age=0 [22:41:55] < Location: http://en.wikipedia.org/wiki/Foo [22:42:13] ^ the relevance is that our canonicalization redirects specify s-maxage=1200 to varnish [22:42:40] the RFC says that in general, requests with request-header Authorization shouldn't be cached by a shared cache, but then makes an exception for s-maxage [22:44:18] so what I'm saying is, I think varnish is being RFC-correct here. The question is whether we should be issuing canonicalization responses for these types of requests in MW at all or not [22:45:49] ah, sorry, still mixing up the pre- and post-varnish cache headers :( [22:46:19] so yes, Varnish is correct here, but the site breaks :) [22:46:41] yeah I just don't know what the "right" fix is here [22:47:00] tgr: But that is what https://gerrit.wikimedia.org/r/#/c/223961 fixes right? Not redirecting when the headers are present? [22:47:18] given that OAuth is an authorization method, I'm not sure how easy it is to predict if something migth be accessed by it in the future [22:47:59] bd808: yes but next time someone visits it in a browser they will still get a 301 and it will still be cached [22:48:23] it's not great if I can break all enwiki apps for an hour just by visiting Special:OAuth/identify [22:49:04] tgr: next time someone visits what in a browser? [22:49:29] bblack: Special:OAuth/identify is the case where this came up [22:50:19] OAuth uses it as a noop where there is no real action but the caller learns the identitiy of the user who gave the OAuth grant [22:50:24] even now, after your fix? [22:50:56] the fix checks whether you have an OAuth header and does not redirect if you do [22:51:05] that's not really caching-friendly [22:51:28] we could disable canonicalization for Special:OAuth and children completely [22:51:53] so, with the fix in place, it should *not* be the case that "I can break all enwiki apps for an hour just by visiting Special:OAuth/identify", right? [22:52:17] The fix applied avoids canonical redirects for all requests that contain OAuth headers, but if the same URL is requested without OAuth headers then it will still redirect and presumably Varnish will cache the 301 and server it up even if the header is present on a second request [22:52:53] yeah, that might be true [22:53:01] can I test that without causing horrible carnage? [22:53:06] see https://phabricator.wikimedia.org/P929 for a reproduction [22:53:19] RECOVERY - puppet last run on es2007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:53:26] you can break testwiki and we won't be too sad [22:54:31] if we disable canonicalization for Special:OAuth/* regardless of headers that will *probably* fix this for all requests that will actually happen [22:55:00] in theory you can use OAuth for any request, not just Special:OAuth or the API [22:55:04] as far as I can see [22:55:15] but there is no reason you would [22:55:20] yeah that seems hacky though, just a path regex exception or whatever [22:55:48] this happens in MediaWiki so we can just check the title [22:55:59] well we could also revert the patch that caused all this and send them back tot he drawing board [22:56:00] that way it is not hacky [22:56:47] my tests on beta.wmflabs.org say it doesn't happen there. assuming beta doesn't have some general cache-busting stuff that works around it anyways [22:58:02] bblack-mba:core bblack$ curl -H 'Authorization: OAuth X' 'http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:OAuth/identify'; echo [22:58:05] {"error":"mwoauth-oauth-exception"} [22:58:08] bblack-mba:core bblack$ curl -v 'http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:OAuth/identify' 2>&1 | egrep '301|Locat|Cache-Control' [22:58:10] < HTTP/1.1 301 Moved Permanently [22:58:13] < Location: http://en.wikipedia.beta.wmflabs.org/wiki/Special:OAuth/identify [22:58:16] < Cache-Control: private, s-maxage=0, max-age=0, must-revalidate [22:58:18] bblack-mba:core bblack$ curl -H 'Authorization: OAuth X' 'http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:OAuth/identify'; echo [22:58:21] {"error":"mwoauth-oauth-exception"} [22:58:29] ^ OAuth req -> 301'd req -> OAuth req not using the 301 out of cache [22:59:03] I'll try it on a small-volume real wiki [23:00:04] RoanKattouw ostriches rmoen Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150709T2300). Please do the needful. [23:00:04] Krenair: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:14] hmm same results on ca.wikipedia.org, but it also doesn't give an oauth exception either, just a blank response [23:00:21] maybe not enabled there? [23:00:23] no idea [23:00:44] oh wait, test was invalid, http->https [23:00:49] (03CR) 10Catrope: [C: 032] TitleBlacklist: Don't block account auto-creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223859 (owner: 10Alex Monk) [23:00:52] bblack: you can test at https://test.wikipedia.org/w/index.php?title=Special:OAuth/identity [23:01:20] (03Merged) 10jenkins-bot: TitleBlacklist: Don't block account auto-creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/223859 (owner: 10Alex Monk) [23:01:25] And it will break https://tools.wmflabs.org/oauth-hello-world/index.php?action=identify if it does cache the 301 [23:01:39] I just tested on ca.wikipedia.org, it doesn't break it there [23:02:14] oh.. hmm. the test app actually hits mw.o [23:02:22] !log catrope Synchronized wmf-config/CommonSettings.php: TitleBlacklist: Don't block account auto-creation (duration: 00m 13s) [23:02:24] let's not break that on purpose [23:02:26] Logged the message, Master [23:03:01] bblack-mba:core bblack$ curl -v 'https://ca.wikipedia.org/w/index.php?title=Special:OAuth/identify' 2>&1 | egrep '301|Locat|Cache-Control' [23:03:04] < HTTP/1.1 301 Moved Permanently [23:03:06] < Location: https://ca.wikipedia.org/wiki/Especial:OAuth/identify [23:03:09] < Cache-Control: private, s-maxage=0, max-age=0, must-revalidate [23:03:11] bblack-mba:core bblack$ curl -H 'Authorization: OAuth X' 'https://ca.wikipedia.org/w/index.php?title=Special:OAuth/identify'; echo [23:03:14] {"error":"mwoauth-oauth-exception"} [23:03:27] ^ the redirect doesn't become a cached result for a request with an auth header [23:05:26] but this all still smells like special-caseing, with having a hook into the oauth extension to suppress 301 if the request contains 'Authorization: OAuth ' [23:05:57] MW is all special casing [23:06:02] :) [23:06:15] hooks inside hooks with some hacks on the side [23:07:40] !log bounced cassandra on restbase1004 [23:08:16] poor cassandra. get well soon [23:09:24] I think I recently saw a tear in the corner of her beautiful eyes [23:10:42] http://cassandra.apache.org/media/img/cassandra_logo.png [23:11:36] tgr, bd808: was it ever the case, even before today's 301-fixup, that requests with "Authorization: " headers were responded to with a 301? [23:11:55] well, let me amend that statement.... [23:12:06] tgr, bd808: was it ever the case, even before today's 301-fixup, that requests with "Authorization: " headers were responded to with a cached 301 by varnish? [23:12:15] bblack: you have magical fingers, apparently [23:12:24] still happening for me: https://phabricator.wikimedia.org/P930 [23:13:36] hmmm what's different in our tests? [23:13:48] I have no idea, but I imagine it was always like that if you requested an URL which produced a redirect [23:14:06] well I'm still digging to confirm [23:14:33] but I'm pretty sure varnish has built-in default behavior at some layer or other which says "If the request contains an Authorization header, bypass caching" [23:15:31] ok wait a minute, your test is wierd in some non-obvious way I think [23:15:33] but as you pointed out that should be overridable with cache-control: public and MediaWiki does exactly that [23:16:09] what's with the %C3%A1 -> a in the redirect? [23:16:24] like, why is that redirect happening at all in mediawiki? [23:16:33] hm [23:16:41] a localization redirect? [23:16:51] I don't know [23:16:52] didn't think of that [23:17:16] so I guess the "Special:" gets localized, in general? [23:17:20] from the canonical special page name to the local special page name, probably [23:17:31] I wonder why that does not happen on ca [23:17:48] but, I thought we just rolled out a change saying no canonical redirects for requests with Auth: OAuth anyways [23:18:05] it must be some other part of varnish doing that redirect for localization? [23:18:08] errrr... [23:18:16] it must be some other part of MediaWiki doing that redirect for localization? [23:18:17] or it gets cached [23:18:48] I don't think that 301 is ever a cached response from varnish, when the request contains any Authorization header [23:18:58] PROBLEM - puppet last run on mc2005 is CRITICAL puppet fail [23:19:14] look at the X-Cache headers in https://phabricator.wikimedia.org/P929 [23:19:50] anyway, let me run that test on enwikivoyage to have less confounding factors [23:20:44] there was only one that was not a miss/miss, and even then X-Cache can be misleader, but we can look into it [23:20:58] 6operations, 10Wikimedia-IRC: enable IPv6 on irc.wikimedia.org - https://phabricator.wikimedia.org/T105422#1443503 (10Dzahn) 3NEW [23:21:03] but really, I think step 1 on any of these tests, is test it directly on mediawiki without varnish and see that you get expected behavior there first [23:21:34] I can reproduce caching on enwikivoyage [23:22:09] maybe what happened on your test is that you tested first with an OAuth header then without then with it again [23:22:18] hmmmm [23:22:22] maybe [23:22:30] in that case on the first request a hit_for_pass is cached [23:22:54] (03PS4) 10Dzahn: add AAAA record for argon (irc,rc streams) [dns] - 10https://gerrit.wikimedia.org/r/214506 (https://phabricator.wikimedia.org/T105422) [23:23:08] (03PS7) 10Dzahn: add IPv6 for argon (irc,mw-rc streams) [puppet] - 10https://gerrit.wikimedia.org/r/214434 (https://phabricator.wikimedia.org/T105422) [23:23:15] here are my tests: https://phabricator.wikimedia.org/P931 [23:23:17] no [23:23:36] 6operations, 10Wikimedia-IRC, 5Patch-For-Review: enable IPv6 on irc.wikimedia.org - https://phabricator.wikimedia.org/T105422#1443521 (10Dzahn) [23:23:40] the problem here is that our own VCL code breaks the varnish-default behavior of not using cache for reqs with the authorization header [23:23:46] I did test locally and it did work as expected [23:23:59] because at the bottom of text's vcl_recv, we don't fall-through to the default vcl_recv, and we don't pay attention to the header either [23:24:13] anyway if the problem exists outside my head then https://gerrit.wikimedia.org/r/#/c/223980/ should fix it :) [23:24:56] if you can fix it in Varnish, even better [23:25:24] tgr: I agree that would work around this, but at a deeper level, we shouldn't be breaking varnish's default/correct behavior of doing "return (pass)" when an Authorization header is present [23:25:37] but I need to do some digging to make sure that's not intentional for some other messed up reason [23:26:44] it's over my head whether that's the correct thing to do or not, I'll hold back the patch until you comment on the task then [23:27:04] I think after the first patch got deployed there are no urgent problems anymore [23:31:31] (03PS2) 10Dzahn: ferm rules for IRCd [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) [23:31:35] tgr: unless someone fetches the URL without an auth header :) [23:32:18] well that's unlikely enough that we can leave it like that for a few days [23:33:54] 6operations, 10Wikimedia-IRC, 5Patch-For-Review: enable IPv6 on irc.wikimedia.org - https://phabricator.wikimedia.org/T105422#1443581 (10Krinkle) [23:34:39] so, in short, MW-level issues aside: (1) the standards allow caching a response to an Authorization request, if s-max-age, etc (2) varnish's default built-in vcl_recv includes a (pass) for all Authorization requests anyways, but (3) Our custom VCL prevents that default VCL from ever running, so we do ignore it for caching [23:35:57] RECOVERY - puppet last run on mc2005 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [23:36:41] I've looked at this a little bit in VCL now, and I think we can safely switch back to correct behavior there, I'll try to get a patch in later this afternoon/evening [23:38:00] tgr: also, I can test that part independently of OAuth, so if you want to push https://gerrit.wikimedia.org/r/#/c/223980 that doesn't have any effect on what I'm looking at [23:38:56] if you fix the VCL, I don't think there will be any need for the patch [23:39:45] you still don't want localization redirects from mediawiki for Special URLs right? [23:40:00] hm.. No commons images on wikitech? [23:40:09] Why not. [23:40:43] (03PS1) 10Dzahn: add admin group 'wikidata query service deployers' [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) [23:41:03] Krinkle: I thought we had fixed instant commons on wikitech a while ago. Broken again? [23:41:25] https://wikitech.wikimedia.org/wiki/File:RCStream_example.png [23:41:28] Seems so [23:41:29] !log deployed patch for T105413 [23:41:33] Logged the message, Master [23:43:01] Krinkle: https://wikitech.wikimedia.org/wiki/File:Male_human_head_louse.jpg worked ... [23:43:35] bblack: redirects are disabled now if the request has an OAuth header so if such requests never hit the cache, there is no problem [23:43:50] well, right, ok [23:44:09] bblack: weird [23:44:41] https://wikitech.wikimedia.org/wiki/File:Police_station_in_Dokkum.jpg [23:44:42] also works [23:45:18] what version is wikitech on? [23:45:30] only wmf13 is deployed at the moment [23:45:59] speaking of deployments, I can't log into tin [23:46:04] but wanted to sync a js file :/ [23:47:18] ssh: connect to host bast1001.wikimedia.org port 22: Network is unreachable [23:47:48] Try one of the other bastions [23:48:00] it works for me [23:48:01] You should have access to hooft and the one in codfw (I think) [23:48:22] I don't the codfw is ops only? [23:48:29] I think the* [23:48:33] regarding that last comment see https://gerrit.wikimedia.org/r/#/c/222519/ [23:48:41] mutante wanted to change that AFAIR [23:48:43] and https://gerrit.wikimedia.org/r/#/c/222522/ [23:49:20] ah yes, I can ssh via hooft, ok [23:49:30] currently as a deployer you have hooft, yes [23:49:43] but i think "bastiononly" should also give hooft [23:49:50] wonder what was up with bast1001 [23:49:50] 6operations, 6Multimedia, 6Performance-Team, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1443620 (10Bawolff) Isnt that the point of that extra attribute with the 2x sizes, we already serve? [23:50:55] well part of the problems here is we have no "iron" for esams, either [23:50:56] Is Gerrit not doing the auto-submodule update now? [23:51:25] Krenair: it was two hours ago [23:51:33] what was two hours ago? [23:51:35] oh [23:51:40] gerrit was doing it two hours ago [23:51:43] but then again, does the iron-bastion distinction make sense now that forwarding is disabled? [23:52:04] So why did this not work: https://gerrit.wikimedia.org/r/#/c/223983/ [23:52:18] bblack: yes, i think we want to setup "iron" in codfw [23:52:30] and then open the regular bast* [23:52:43] well, unless we give up on the distinction [23:52:59] I can't get to gerrit from my laptop now either [23:54:12] mutante: I suspect we can give up the distinction for ssh purposes, but iron's used for things other than "be a bastion" too, which kinda goes back the other way [23:55:08] in any case, the iron/bast1001 split was never quite right anyways. it probably should've been exclusive instead of just allowed (as in, you're not allowed to forward a root-capable key through bast1001) [23:55:12] not that any of that matters now [23:56:08] bblack: i think bastion should really mean bastion as in "not used for things" and then iron should be the "ops work host" or something .. but _behind_ a universal bastion [23:57:10] (03PS6) 10Negative24: Phabricator: Create differential puppet role [puppet] - 10https://gerrit.wikimedia.org/r/222987 (https://phabricator.wikimedia.org/T104827) [23:57:53] citation - WP article on bastion host "generally hosts a single application, for example a proxy server, and all other services are removed or limited to reduce the threat " [23:57:55] I think my ISP must be having issues or something. [23:58:26] I'm having problems connecting to wikitech, phabricator, gerrit, gitblit, tin [23:58:53] or at least, bast1001 [23:59:14] Krenair: traceroute? [23:59:17] mutante: probably right [23:59:47] let's move this to -security, re: bast1001 + Krenair's problems