[00:05:50] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1966 bytes in 0.132 second response time [00:11:36] byron: :) yes, happy to try that again and remove the travel@ alias. right now? [00:12:37] 10Operations, 10Mail: move travel related aliases to OIT - https://phabricator.wikimedia.org/T127549#4152035 (10Dzahn) @bbogaert Yes, happy to try that again. I can remove it right now if you want. I am also on IRC. - Thanks, Daniel [00:14:53] mutante: sure [00:15:23] mutante: travel@ and travelapprovals@ [00:15:30] byron: ok, one sec! [00:18:44] !log removing travel@ and travelapproval@ exim aliases, moving to OIT/Google (T127549) [00:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:50] T127549: move travel related aliases to OIT - https://phabricator.wikimedia.org/T127549 [00:18:50] byron: removed on mx1001 and mx2001 [00:20:13] travel@wikimedia.org [00:20:14] router = ldap_group, transport = remote_smtp [00:20:16] Ok, cool [00:20:19] lgtm [00:20:51] same on both mail servers, it says to me it's Google now [00:21:38] Ok, cool [00:21:52] Does not seem like we ran into the cache callout this time? [00:21:56] 10Operations, 10Mail: move travel related aliases to OIT - https://phabricator.wikimedia.org/T127549#4152043 (10Dzahn) ``` [mx1001:~] $ sudo exim4 -bt travel@wikimedia.org travel@wikimedia.org router = ldap_group, transport = remote_smtp host aspmx.l.google.com [173.194.204.27] [mx2001:~] $ sudo exim4... [00:22:35] byron: have you sent a mail to it? can you see it in the google group? [00:23:18] Yes, that worked [00:24:19] Thanks mutante!! [00:25:14] byron: great:) i also sent one. glad to see this resolved :) [00:25:23] one of the very few that kept that large tracking ticket open [00:25:28] thanks [00:25:37] :| [00:26:33] 10Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#4152052 (10Dzahn) [00:26:35] 10Operations, 10Mail: move travel related aliases to OIT - https://phabricator.wikimedia.org/T127549#4152049 (10Dzahn) 05Open>03Resolved a:03Dzahn 20:22 < mutante> byron: have you sent a mail to it? can you see it in the google group? 20:23 < byron> Yes, that worked [00:26:44] byron: there is only a single one left :) [00:26:50] legal-tm-vio@ [00:26:56] Ok, i'm doing it [00:26:58] I'm sorry! [00:27:58] no worries at all, thank you:) [00:30:12] byron: i see there is one group inside anohter group though .. hope that isn't an issue [00:43:54] mutante: Nope, should not be a problem [00:44:08] mutante: I just added legal-tm-vio [01:01:18] byron: :) cool, removing! [01:01:26] sorry for that delay, had to move my car [01:23:37] 10Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#4152074 (10Dzahn) [01:23:41] 10Operations, 10Mail, 10Office-IT, 10WMF-Legal: move legal-tm-vio alias to OIT - https://phabricator.wikimedia.org/T170365#4152071 (10Dzahn) 05Open>03Resolved a:03Dzahn 20:44 < byron> mutante: I just added legal-tm-vio [mx2001:~] $ sudo exim4 -bt legal-tm-vio@wikimedia.org legal-tm-vio@wikimedia.org... [01:27:12] 10Operations, 10Design-Research-Archive: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860#4152076 (10Dzahn) 05Resolved>03Open Does anyone know if this is still used nowadays and whether it should be changed or removed? [01:29:14] 10Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#4152079 (10Dzahn) all subtasks are resolved now. yay! all remaining things are either "ops" internal or technical (like techcom@ , packagist-admin@, analytics-alerts@ ) (maybe one remaining questio... [01:29:24] 10Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#4152081 (10Dzahn) 05Open>03Resolved a:03Dzahn [02:11:57] (03PS1) 10Bstorm: wiki replicas: index script should be able to operate on one DB [puppet] - 10https://gerrit.wikimedia.org/r/428550 [02:32:55] (03PS1) 10Samwilson: Deploy GlobalPreferences to test wikis and mw.org (third time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428554 [02:55:05] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.30) (duration: 10m 37s) [02:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:54] 10Operations, 10Design-Research-Archive: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860#4152163 (10Dzahn) @bbogaert Any thoughts on this ? Should we move it or remove it maybe? Take a look at the people on it, not sure who is still working at WMF. +optoutresearch: ar... [03:08:50] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1965 bytes in 0.105 second response time [03:08:54] !log reinstalling mw2224.codfw.wmnet with wmf-auto-reimage [03:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:38] (03CR) 10Krinkle: [C: 031] Don't try to set wgSiteSupportPage, ignored for a decade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428365 (https://phabricator.wikimedia.org/T192467) (owner: 10Jforrester) [03:18:50] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1951 bytes in 0.100 second response time [04:35:44] !log repooled mw2224, reinstalling mw2225 through mw2228 [04:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:44] (03CR) 10Giuseppe Lavagetto: [C: 032] Make envoy build work when operating behind a proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/424857 (owner: 10Giuseppe Lavagetto) [04:45:51] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Make envoy build work when operating behind a proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/424857 (owner: 10Giuseppe Lavagetto) [05:03:19] <_joe_> !log rebuilding the docker base images [05:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:00] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4152206 (10Joe) [05:15:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428557 [05:17:27] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428557 (owner: 10Marostegui) [05:18:42] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428557 (owner: 10Marostegui) [05:19:18] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428557 (owner: 10Marostegui) [05:21:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1096:3316 after alter table (duration: 00m 59s) [05:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:43] https://commons.wikimedia.org/wiki/File:Nouveau_Larousse_illustr%C3%A9,_1898,_IV.djvu [05:27:57] 0 × 0 pixels, file size: 125.64 MB ^ [05:28:05] how is that possible? [05:28:28] also https://commons.wikimedia.org/wiki/File:Nouveau_Larousse_illustr%C3%A9,_1898,_V.djvu [05:28:38] and https://commons.wikimedia.org/wiki/File:Nouveau_Larousse_illustr%C3%A9,_1898,_VI.djvu [05:30:19] may be related to https://phabricator.wikimedia.org/T142939 [05:30:31] should I add to this report or open a new one? [05:33:38] yannf: I'd open a new one, referencing the old in the description :) [05:33:49] ok [05:36:41] ok, done https://phabricator.wikimedia.org/T192866 [05:43:22] thanks! [05:51:24] (03PS1) 10Elukey: reportupdater: create logfile after rotation [puppet] - 10https://gerrit.wikimedia.org/r/428561 (https://phabricator.wikimedia.org/T191871) [05:52:36] (03CR) 10Elukey: [C: 032] reportupdater: create logfile after rotation [puppet] - 10https://gerrit.wikimedia.org/r/428561 (https://phabricator.wikimedia.org/T191871) (owner: 10Elukey) [06:29:01] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:30:25] (03PS1) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428562 (https://phabricator.wikimedia.org/T190148) [06:32:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428562 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [06:33:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428562 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [06:34:01] RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:34:57] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1093 for alter table (duration: 00m 59s) [06:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:07] !log Deploy schema change on db1093 - T191519 T188299 T190148 [06:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:14] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [06:35:14] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [06:35:14] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [06:39:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428562 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [06:40:53] (03PS2) 10Muehlenhoff: Remove mediawiki::firejail [puppet] - 10https://gerrit.wikimedia.org/r/428382 [06:41:51] !log installing poppler security updates [06:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:21] (03PS1) 10Muehlenhoff: Update Cumin alias for maps [puppet] - 10https://gerrit.wikimedia.org/r/428564 [06:50:13] (03CR) 10Muehlenhoff: [C: 032] Update Cumin alias for maps [puppet] - 10https://gerrit.wikimedia.org/r/428564 (owner: 10Muehlenhoff) [06:56:19] !log restart zookeeper on conf200[123] for openjdk upgrades [06:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:58] (03CR) 10Gilles: [C: 031] Remove mediawiki::firejail [puppet] - 10https://gerrit.wikimedia.org/r/428382 (owner: 10Muehlenhoff) [07:10:29] 10Operations, 10hardware-requests: request to assign WMF3565 as terbium equivalent - https://phabricator.wikimedia.org/T192185#4131143 (10MoritzMuehlenhoff) >>! In T192185#4131159, @Dzahn wrote: > The name would be **nihonium**, element 113. I'd rather use a functional name here, e.g. mwmaint1001.eqiad.wmnet.... [07:13:35] 10Operations, 10hardware-requests: request to assign WMF3565 as terbium equivalent - https://phabricator.wikimedia.org/T192185#4152337 (10MoritzMuehlenhoff) >>! In T192185#4138132, @faidon wrote: > WMF3565 is > 5 years old, so there's really no point in setting hardware that old right now. > > How urgent is t... [07:16:51] !log Update puppet compiler facts [07:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:43] (03PS1) 10Hoo man: Set default for $wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428565 (https://phabricator.wikimedia.org/T188456) [07:19:14] (03PS2) 10Marostegui: wiki replicas: Depool labsdb1011 for MCR table additions [puppet] - 10https://gerrit.wikimedia.org/r/428361 (https://phabricator.wikimedia.org/T184446) (owner: 10Bstorm) [07:22:45] (03CR) 10Marostegui: [C: 032] wiki replicas: Depool labsdb1011 for MCR table additions [puppet] - 10https://gerrit.wikimedia.org/r/428361 (https://phabricator.wikimedia.org/T184446) (owner: 10Bstorm) [07:23:43] !log Reload haproxy on dbproxy1010 to depool labsdb1011 - T184446 [07:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:49] T184446: Configure Toolforge replica views and dumps for the new MCR tables - https://phabricator.wikimedia.org/T184446 [07:28:26] !log restarting blazegraph on wdqs1004 for jvm upgrade [07:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:38] (03CR) 10Vgutierrez: [C: 032] "pcc happy showing noop: https://puppet-compiler.wmflabs.org/compiler02/11010/" [puppet] - 10https://gerrit.wikimedia.org/r/428422 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [07:30:44] (03PS2) 10Vgutierrez: hieradata: clean-up eqsin lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/428422 (https://phabricator.wikimedia.org/T191897) [07:33:45] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Matthias Geisler to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4152367 (10Matthias_Geisler_WMDE) My official address is: matthias.geisler@wikimedia.de if I am not in the office and it's urgent you can ping me at: g... [07:39:29] !log Rename user_old and user_temp tables on db1077 - T172664 [07:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:09] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324#4152376 (10Gilles) [07:41:14] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4152374 (10Gilles) 05Open>03Resolved Since I'm very confident about the fix, I'm going to close this task, feel free to reopen if the issue reocc... [07:50:32] !log Started running populateSitesTable.php for all wikis (T192628, T192632, T192631, T192633) [07:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:42] T192633: Please add Wikidata support for euwikisource - https://phabricator.wikimedia.org/T192633 [07:50:42] T192631: Add Wikidata support for gorwiki - https://phabricator.wikimedia.org/T192631 [07:50:42] T192632: Please add Wikidata support for inhwiki - https://phabricator.wikimedia.org/T192632 [07:50:42] T192628: Add Wikidata support for 'lfnwiki' - https://phabricator.wikimedia.org/T192628 [07:52:20] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0 [07:52:50] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0 [07:56:50] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 [07:57:20] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 [08:05:03] !log power off restbase1010 for ssd replacement - T189822 [08:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:09] T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822 [08:10:40] 10Operations, 10ops-eqiad, 10Cassandra, 10hardware-requests, and 2 others: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4152450 (10fgiunchedi) a:05RobH>03Cmjohnson @Cmjohnson restbase1010 is powered down and ready to have all of its ssd swapped [08:14:02] !log upload druid_0.10.0-3~jessie1 (collection of druid packages) to jessie-wikimedia - T164008 [08:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:07] T164008: Update druid to 0.10 - https://phabricator.wikimedia.org/T164008 [08:16:57] (03PS1) 10Jcrespo: base: Add disable atop functionality and test it on dbtore hosts [puppet] - 10https://gerrit.wikimedia.org/r/428571 (https://phabricator.wikimedia.org/T192551) [08:17:40] !log Finished running populateSitesTable.php for all wikis (T192628, T192632, T192631, T192633) [08:17:40] (03CR) 10jerkins-bot: [V: 04-1] base: Add disable atop functionality and test it on dbtore hosts [puppet] - 10https://gerrit.wikimedia.org/r/428571 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [08:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:48] T192633: Please add Wikidata support for euwikisource - https://phabricator.wikimedia.org/T192633 [08:17:48] T192631: Add Wikidata support for gorwiki - https://phabricator.wikimedia.org/T192631 [08:17:48] T192632: Please add Wikidata support for inhwiki - https://phabricator.wikimedia.org/T192632 [08:17:48] T192628: Add Wikidata support for 'lfnwiki' - https://phabricator.wikimedia.org/T192628 [08:22:55] (03PS2) 10Jcrespo: base: Add disable atop functionality and test it on dbtore hosts [puppet] - 10https://gerrit.wikimedia.org/r/428571 (https://phabricator.wikimedia.org/T192551) [08:22:57] (03PS1) 10Filippo Giunchedi: netops: add asw2-a-eqiad and asw2-c-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/428572 (https://phabricator.wikimedia.org/T191896) [08:23:58] (03PS3) 10Jcrespo: base: Add disable atop functionality and test it on dbtore hosts [puppet] - 10https://gerrit.wikimedia.org/r/428571 (https://phabricator.wikimedia.org/T192551) [08:24:15] (03PS2) 10Filippo Giunchedi: netops: add asw2-a-eqiad and asw2-c-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/428572 (https://phabricator.wikimedia.org/T187960) [08:27:13] (03PS4) 10Jcrespo: base: Add disable atop functionality and test it on dbstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/428571 (https://phabricator.wikimedia.org/T192551) [08:28:33] (03CR) 10Jcrespo: [C: 032] base: Add disable atop functionality and test it on dbstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/428571 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [08:36:43] (03PS1) 10Jcrespo: atop: Disable atop on core&dbstore roles to test jessie/trusty [puppet] - 10https://gerrit.wikimedia.org/r/428574 (https://phabricator.wikimedia.org/T192551) [08:37:52] (03CR) 10Jcrespo: [C: 032] atop: Disable atop on core&dbstore roles to test jessie/trusty [puppet] - 10https://gerrit.wikimedia.org/r/428574 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [08:41:39] 10Operations, 10monitoring, 10Patch-For-Review: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4152541 (10Marostegui) Bug submited to Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=896767 [08:45:27] akosiaris: when you get a second, to fix icinga configuration errors: https://gerrit.wikimedia.org/r/c/428572/ [08:47:24] 10Operations, 10ops-eqiad, 10Cassandra, 10hardware-requests, and 3 others: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4152564 (10fgiunchedi) [08:50:11] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/428572 (https://phabricator.wikimedia.org/T187960) (owner: 10Filippo Giunchedi) [08:50:24] volans: thanks! [08:50:36] (03PS1) 10Elukey: Set PXE boot to Debian Stretch for kafka[12]00[123] [puppet] - 10https://gerrit.wikimedia.org/r/428575 (https://phabricator.wikimedia.org/T192832) [08:50:41] thank you for the fix ;) [08:50:56] (03PS3) 10Filippo Giunchedi: netops: add asw2-a-eqiad and asw2-c-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/428572 (https://phabricator.wikimedia.org/T187960) [08:52:39] (03CR) 10Filippo Giunchedi: [C: 032] netops: add asw2-a-eqiad and asw2-c-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/428572 (https://phabricator.wikimedia.org/T187960) (owner: 10Filippo Giunchedi) [08:55:30] (03PS1) 10ArielGlenn: use default installer for snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/428577 (https://phabricator.wikimedia.org/T161509) [08:56:22] we're back [08:56:23] Things look okay - No serious problems were detected during the pre-flight check [08:56:44] (03PS1) 10Jcrespo: base: Disable atop daemon everywhere [puppet] - 10https://gerrit.wikimedia.org/r/428579 (https://phabricator.wikimedia.org/T192551) [08:59:30] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [08:59:45] (03CR) 10Jcrespo: "Ariel: I think you were a user of atop- feel free to speak up- as you can see it is very easy to reenable it for selected hosts or do some" [puppet] - 10https://gerrit.wikimedia.org/r/428579 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [08:59:57] (03CR) 10ArielGlenn: [C: 032] use default installer for snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/428577 (https://phabricator.wikimedia.org/T161509) (owner: 10ArielGlenn) [09:00:07] (03PS2) 10ArielGlenn: use default installer for snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/428577 (https://phabricator.wikimedia.org/T161509) [09:01:48] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4152599 (10Vgutierrez) There is some stuff still missing, like setting the management password, if that's handled I think I can continue from there. [09:01:50] \o/ [09:02:14] (03PS1) 10Ema: VCL: only parse X-Connection-Properties if available [puppet] - 10https://gerrit.wikimedia.org/r/428580 [09:02:16] (03PS1) 10Elukey: Move static ipv6 configuration for analytics hosts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/428581 [09:02:28] 10Operations, 10monitoring, 10Patch-For-Review, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4152600 (10jcrespo) I've sent a notice upstream: https://github.com/Atoptool/atop/issues/27 [09:03:51] !log mobrovac@tin Started deploy [restbase/deploy@1661f69]: Increase the deletion probability to 50% and expose the CSS end points - T192689 T190846 [09:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:59] T192689: Unchecked storage growth(?) - https://phabricator.wikimedia.org/T192689 [09:04:00] T190846: CSS endpoint public rollout - https://phabricator.wikimedia.org/T190846 [09:04:30] PROBLEM - MD RAID on ms-be1043 is CRITICAL: CRITICAL: State: degraded, Active: 6, Working: 6, Failed: 0, Spare: 0 [09:04:31] ACKNOWLEDGEMENT - MD RAID on ms-be1043 is CRITICAL: CRITICAL: State: degraded, Active: 6, Working: 6, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T192874 [09:04:36] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1043 - https://phabricator.wikimedia.org/T192874#4152610 (10ops-monitoring-bot) [09:07:11] (03PS1) 10Elukey: Update Gemfile after puppet-lint-wmf_styleguide-check update [puppet] - 10https://gerrit.wikimedia.org/r/428582 [09:07:37] (03CR) 10jerkins-bot: [V: 04-1] Update Gemfile after puppet-lint-wmf_styleguide-check update [puppet] - 10https://gerrit.wikimedia.org/r/428582 (owner: 10Elukey) [09:10:42] (03CR) 10ArielGlenn: "Jaime: that's fine for me. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/428579 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [09:12:41] !log reimaging mw1273, mw1274, mw1275 (app servers) to stretch [09:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:23] Hi, just a question: is SWAT appropriate for backports in REL1_31? [09:15:41] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428583 [09:17:00] Pinging greg-g as suggested on wikitech [09:17:04] !log mobrovac@tin Finished deploy [restbase/deploy@1661f69]: Increase the deletion probability to 50% and expose the CSS end points - T192689 T190846 (duration: 13m 13s) [09:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:11] T192689: Unchecked storage growth(?) - https://phabricator.wikimedia.org/T192689 [09:17:11] T190846: CSS endpoint public rollout - https://phabricator.wikimedia.org/T190846 [09:17:26] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428583 (owner: 10Marostegui) [09:17:50] (03PS1) 10DCausse: Bump extra plugin version to 5.5.2.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/428584 (https://phabricator.wikimedia.org/T191543) [09:18:07] !log mobrovac@tin Started deploy [restbase/deploy@1661f69]: Increase the deletion probability to 50% and expose the CSS end points, take #2 - T192689 T190846 [09:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:42] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428583 (owner: 10Marostegui) [09:18:47] (03CR) 10DCausse: "the sha256sums diff is a bit unreadable but sorting it it looks like:" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/428584 (https://phabricator.wikimedia.org/T191543) (owner: 10DCausse) [09:19:25] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428583 (owner: 10Marostegui) [09:20:02] (03CR) 10Alexandros Kosiaris: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/428572 (https://phabricator.wikimedia.org/T187960) (owner: 10Filippo Giunchedi) [09:20:17] Daimona: I think with the actual release branches (RELX_XX) anyone can cherry pick patches in, I think the swat page talks about the wmfX.X release branches [09:20:51] but chad and sam are probably good people to ping [09:21:01] (03PS2) 10Ema: VCL: only parse X-Connection-Properties if available [puppet] - 10https://gerrit.wikimedia.org/r/428580 [09:21:09] !log mobrovac@tin Finished deploy [restbase/deploy@1661f69]: Increase the deletion probability to 50% and expose the CSS end points, take #2 - T192689 T190846 (duration: 03m 03s) [09:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:18] !log reimaging mw1221, mw1222, mw1223 (app servers) to stretch [09:21:19] !log mobrovac@tin Started deploy [restbase/deploy@1661f69]: Increase the deletion probability to 50% and expose the CSS end points, take #3 - T192689 T190846 [09:21:21] p858snake: Indeed, I had this doubt after adding patches for today's SWAT. I don't want to take away 3 slots for stuff that can't be done :-) [09:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:26] !log reimaging mw1221, mw1222, mw1223 (API servers) to stretch [09:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:08] Daimona: SWAT is only for the versions currently running on the wikis in WMF production, so wmfX.Y, not for REL branches [09:22:10] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/css/mobile/base (Get base CSS) is CRITICAL: Test Get base CSS returned the unexpected status 404 (expecting: 200) [09:22:17] known ^ [09:23:03] Fine, thanks mobrovac, I'll remove my commits [09:23:43] (03PS11) 10Volans: First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) [09:23:45] (03PS10) 10Volans: Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) [09:23:47] (03PS12) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) [09:23:49] (03PS8) 10Volans: Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) [09:23:51] (03PS3) 10Volans: Add server side validation of client certificates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) [09:23:53] (03PS1) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428585 (https://phabricator.wikimedia.org/T190148) [09:23:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1093 after alter table (duration: 03m 06s) [09:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:04] (03CR) 10jerkins-bot: [V: 04-1] First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [09:24:06] (03CR) 10jerkins-bot: [V: 04-1] Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [09:24:08] (03CR) 10jerkins-bot: [V: 04-1] Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [09:24:10] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [09:24:10] (03CR) 10jerkins-bot: [V: 04-1] Add server side validation of client certificates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [09:24:12] (03CR) 10jerkins-bot: [V: 04-1] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [09:25:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428585 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:25:49] !log mobrovac@tin Finished deploy [restbase/deploy@1661f69]: Increase the deletion probability to 50% and expose the CSS end points, take #3 - T192689 T190846 (duration: 04m 30s) [09:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:56] T192689: Unchecked storage growth(?) - https://phabricator.wikimedia.org/T192689 [09:25:56] T190846: CSS endpoint public rollout - https://phabricator.wikimedia.org/T190846 [09:26:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428585 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:27:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1088 for alter table (duration: 00m 58s) [09:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:12] !log Deploy schema change on db1088 - T191519 T188299 T190148 [09:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:19] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [09:28:19] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [09:28:20] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [09:31:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428585 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:32:34] (03PS2) 10Jcrespo: mariadb: Depool db1110 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428285 [09:33:56] (03CR) 10Gehel: "Would it make sense to add a "sort" to `check_sha256`?" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/428584 (https://phabricator.wikimedia.org/T191543) (owner: 10DCausse) [09:35:32] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1110 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428285 (owner: 10Jcrespo) [09:35:48] (03PS2) 10Jcrespo: base: Disable atop daemon everywhere [puppet] - 10https://gerrit.wikimedia.org/r/428579 (https://phabricator.wikimedia.org/T192551) [09:35:50] (03PS1) 10Jcrespo: install_server: Temporarily allow the reimage of db11** hosts [puppet] - 10https://gerrit.wikimedia.org/r/428586 [09:36:50] (03CR) 10Jcrespo: [C: 032] install_server: Temporarily allow the reimage of db11** hosts [puppet] - 10https://gerrit.wikimedia.org/r/428586 (owner: 10Jcrespo) [09:36:52] (03Merged) 10jenkins-bot: mariadb: Depool db1110 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428285 (owner: 10Jcrespo) [09:36:58] (03PS2) 10Jcrespo: install_server: Temporarily allow the reimage of db11** hosts [puppet] - 10https://gerrit.wikimedia.org/r/428586 [09:37:19] (03CR) 10jenkins-bot: mariadb: Depool db1110 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428285 (owner: 10Jcrespo) [09:37:40] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11011/" [puppet] - 10https://gerrit.wikimedia.org/r/428581 (owner: 10Elukey) [09:38:09] (03PS2) 10Elukey: Move static ipv6 configuration for analytics hosts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/428581 [09:39:02] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1110 (duration: 00m 58s) [09:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:13] snapshot1001.eqiad.wmnet returned [255]: Host key verification failed [09:39:23] apergos^ [09:39:30] (03CR) 10Elukey: [C: 032] Move static ipv6 configuration for analytics hosts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/428581 (owner: 10Elukey) [09:39:34] known? [09:39:36] t's being reimaged [09:39:47] ok [09:39:50] via the wmf reimaging script [09:40:04] I guess after every reimage, pull is done? [09:40:23] I hope so, it's certainly done so in the past [09:40:28] :-) [09:40:38] it is better to run a scap pull before pooling [09:40:41] but i can't remember if it might not finish because mw repo is so large these days [09:40:43] just to avoid any issue [09:41:09] (03PS3) 10Jcrespo: install_server: Temporarily allow the reimage of db11** hosts [puppet] - 10https://gerrit.wikimedia.org/r/428586 [09:41:31] yep, this is a testbed host anyways so there is no 'pool', even the regular snapshot hosts have no 'pool' as such [09:41:40] (03CR) 10Gehel: [C: 031] "LGTM" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/428584 (https://phabricator.wikimedia.org/T191543) (owner: 10DCausse) [09:42:03] yeah, but you still want the latest db config :) [09:42:21] jynus: shall I merge? [09:42:28] I was going to ask the same [09:42:39] go ahead :) [09:42:44] (03CR) 10Gehel: [C: 032] Bump extra plugin version to 5.5.2.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/428584 (https://phabricator.wikimedia.org/T191543) (owner: 10DCausse) [09:42:56] (03CR) 10Gehel: [V: 032 C: 032] Bump extra plugin version to 5.5.2.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/428584 (https://phabricator.wikimedia.org/T191543) (owner: 10DCausse) [09:43:31] yes I will want it, indeed [09:44:27] during the first puppet run the latest mw repo should be deployed [09:47:53] (03PS1) 10Gehel: copper has been replaced by boron [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/428589 [09:48:17] (03PS1) 10Jcrespo: install_server: Reimage db1110, db1111 and db1112 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/428591 [09:49:02] (03CR) 10DCausse: [V: 032 C: 032] copper has been replaced by boron [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/428589 (owner: 10Gehel) [09:55:06] (03PS10) 10Muehlenhoff: mediawiki: Add explicit dependency on ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/313963 [09:55:55] PROBLEM - Nginx local proxy to apache on mw1221 is CRITICAL: connect to address 10.64.48.56 and port 443: Connection refused [09:55:56] PROBLEM - Nginx local proxy to apache on mw1273 is CRITICAL: connect to address 10.64.0.68 and port 443: Connection refused [09:55:56] PROBLEM - Check systemd state on mw1221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:55:56] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:55:56] PROBLEM - Check whether ferm is active by checking the default input chain on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:55:56] PROBLEM - configured eth on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:55:56] PROBLEM - Check systemd state on mw1273 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:55:57] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1274 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:55:57] PROBLEM - Check whether ferm is active by checking the default input chain on mw1275 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:55:58] PROBLEM - configured eth on mw1275 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:56:06] ^ reimage spam, silencing [09:57:35] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:57:35] PROBLEM - Check whether ferm is active by checking the default input chain on mw1222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:57:36] PROBLEM - configured eth on mw1222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:57:36] PROBLEM - DPKG on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:57:36] PROBLEM - dhclient process on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:00:10] (03CR) 10Muehlenhoff: [C: 032] mediawiki: Add explicit dependency on ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/313963 (owner: 10Muehlenhoff) [10:00:29] (03PS3) 10Ema: VCL: only parse X-Connection-Properties if available [puppet] - 10https://gerrit.wikimedia.org/r/428580 [10:00:31] (03PS1) 10Ema: VCL: 400 on empty/unparseable Host header values [puppet] - 10https://gerrit.wikimedia.org/r/428594 [10:01:06] !log reimaged snapshot1001 for testing with php7/stretch [10:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:36] RECOVERY - Check whether ferm is active by checking the default input chain on mw1222 is OK: OK ferm input default policy is set [10:07:36] RECOVERY - configured eth on mw1222 is OK: OK - interfaces up [10:07:50] (03PS2) 10Ema: VCL: 400 on empty/unparseable Host header values [puppet] - 10https://gerrit.wikimedia.org/r/428594 [10:09:45] RECOVERY - dhclient process on mw1223 is OK: PROCS OK: 0 processes with command name dhclient [10:09:45] RECOVERY - DPKG on mw1223 is OK: All packages OK [10:09:56] (03PS4) 10Ema: VCL: only parse X-Connection-Properties if available [puppet] - 10https://gerrit.wikimedia.org/r/428580 [10:09:58] (03PS3) 10Ema: VCL: 400 on empty/unparseable Host header values [puppet] - 10https://gerrit.wikimedia.org/r/428594 [10:10:05] RECOVERY - Check whether ferm is active by checking the default input chain on mw1223 is OK: OK ferm input default policy is set [10:10:05] RECOVERY - configured eth on mw1223 is OK: OK - interfaces up [10:11:05] RECOVERY - Nginx local proxy to apache on mw1221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.519 second response time [10:12:55] (03CR) 10Jcrespo: [C: 032] install_server: Reimage db1110, db1111 and db1112 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/428591 (owner: 10Jcrespo) [10:13:02] (03PS2) 10Jcrespo: install_server: Reimage db1110, db1111 and db1112 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/428591 [10:13:22] (03PS4) 10Ema: VCL: 400 on empty/unparseable Host header values [puppet] - 10https://gerrit.wikimedia.org/r/428594 [10:14:19] (03CR) 10Alexandros Kosiaris: "minor inline comment, rest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [10:16:13] RECOVERY - Check systemd state on mw1221 is OK: OK - running: The system is fully operational [10:18:18] (03PS5) 10Ema: VCL: 400 on empty/unparseable Host header values [puppet] - 10https://gerrit.wikimedia.org/r/428594 [10:23:23] 10Operations, 10hardware-requests: request to assign WMF3565 as terbium equivalent - https://phabricator.wikimedia.org/T192185#4152830 (10Joe) >>! In T192185#4152337, @MoritzMuehlenhoff wrote: >>>! In T192185#4138132, @faidon wrote: >> WMF3565 is > 5 years old, so there's really no point in setting hardware th... [10:24:50] (03PS1) 10ArielGlenn: make addschanges dumps use configured php on snaphot hosts [puppet] - 10https://gerrit.wikimedia.org/r/428595 (https://phabricator.wikimedia.org/T181029) [10:25:58] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1222 is OK: OK: synced at Tue 2018-04-24 10:25:50 UTC. [10:27:38] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1221 is OK: OK: synced at Tue 2018-04-24 10:27:30 UTC. [10:29:04] (03CR) 10ArielGlenn: [C: 032] make addschanges dumps use configured php on snaphot hosts [puppet] - 10https://gerrit.wikimedia.org/r/428595 (https://phabricator.wikimedia.org/T181029) (owner: 10ArielGlenn) [10:29:37] (03CR) 10Ema: [C: 04-1] "We need to ensure that all check_http frontend checks use -H instead of -I alone before merging this." [puppet] - 10https://gerrit.wikimedia.org/r/428594 (owner: 10Ema) [10:29:51] (03PS1) 10Urbanecm: New throttle rule for IndigenizeWikipedia event, clean obsolete rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428597 (https://phabricator.wikimedia.org/T192827) [10:31:35] !log stop and reimage db1110 [10:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:35] RECOVERY - Check systemd state on mw1273 is OK: OK - running: The system is fully operational [10:32:37] (03PS3) 10MarcoAurelio: Grant Meta-Wiki sysops the ability to edit global abusefilter rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428290 (https://phabricator.wikimedia.org/T192722) [10:33:16] RECOVERY - Nginx local proxy to apache on mw1273 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.083 second response time [10:35:54] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1110 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428598 [10:35:58] (03PS1) 10ArielGlenn: use php7.0 on snapshot1001 for dumps testing [puppet] - 10https://gerrit.wikimedia.org/r/428599 (https://phabricator.wikimedia.org/T161509) [10:37:00] (03CR) 10ArielGlenn: [C: 032] use php7.0 on snapshot1001 for dumps testing [puppet] - 10https://gerrit.wikimedia.org/r/428599 (https://phabricator.wikimedia.org/T161509) (owner: 10ArielGlenn) [10:37:29] 10Operations: Upgrade qemu on ganeti clusters to 2.8 - https://phabricator.wikimedia.org/T150532#4152951 (10akosiaris) [10:38:07] 10Operations: Upgrade qemu on ganeti clusters to 2.8 - https://phabricator.wikimedia.org/T150532#2788778 (10akosiaris) mwdebug2001 showed no problems, I 'll proceed with upgrading the entire codfw cluster. A full cluster VM reboot is to follow [10:39:14] !log upgrade to qemu 2.8 on codfw ganeti cluster. T150532 [10:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:20] T150532: Upgrade qemu on ganeti clusters to 2.8 - https://phabricator.wikimedia.org/T150532 [10:39:29] !log starting a very slow rolling reboot of all VMs on codfw ganeti cluster T150532 [10:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:03] akosiaris: can you ping me before bohrium so I'll gracefully shutdown mysql? [10:41:16] ah that's codfw [10:41:17] elukey: sure. but I am still on codfw :-) [10:41:20] misread :) [10:49:17] 10Operations, 10hardware-requests: request to assign WMF3565 as terbium equivalent - https://phabricator.wikimedia.org/T192185#4153064 (10faidon) Let's just use both of them to also set up the stand-in that you mentioned above? (approved) [10:49:24] !log enable puppet in labtestcontrol2001 to sync with repo changes [10:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:58] !log reimage analytics106[56] to Debian Stretch [10:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:55] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1274 is OK: OK: synced at Tue 2018-04-24 10:55:50 UTC. [10:58:58] !log reimaging mw1224, mw1225, mw1226 (API servers) to stretch [10:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:23] (03CR) 10Steinsplitter: [C: 031] "Looks good to me (rebase needed)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428290 (https://phabricator.wikimedia.org/T192722) (owner: 10MarcoAurelio) [11:03:40] 10Operations, 10Ops-Access-Requests: Access to Google Search Console, Tag Manager, and Analytics for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4153156 (10Deskana) [11:07:45] (03PS1) 10Jcrespo: mariadb: Repool db1110 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428601 [11:08:26] (03PS1) 10ArielGlenn: turn off nfs file attr caching on snapshot1001 [puppet] - 10https://gerrit.wikimedia.org/r/428602 [11:09:35] (03CR) 10ArielGlenn: [C: 032] turn off nfs file attr caching on snapshot1001 [puppet] - 10https://gerrit.wikimedia.org/r/428602 (owner: 10ArielGlenn) [11:12:21] 10Operations, 10Wikimedia-Extension-setup, 10Wikimedia-Site-requests, 10Collaboration-Feature-Rollouts (Collaboration-Maps), 10Maps (Kartographer): Enable Kartographer on the Bulgarian Wikipedia - https://phabricator.wikimedia.org/T192895#4153202 (10kerberizer) [11:18:32] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:35:43] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [11:39:43] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:44:03] !log reimaging mw1241, mw1242, mw1243 (app servers) to stretch [11:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:33] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1224 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:48:33] PROBLEM - Check whether ferm is active by checking the default input chain on mw1225 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:48:34] PROBLEM - configured eth on mw1225 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:48:34] PROBLEM - DPKG on mw1226 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:48:34] PROBLEM - dhclient process on mw1226 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:48:47] ^ reimage spam, silencing [11:50:52] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1110 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428601 (owner: 10Jcrespo) [11:52:13] (03Merged) 10jenkins-bot: mariadb: Repool db1110 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428601 (owner: 10Jcrespo) [11:52:29] (03CR) 10jenkins-bot: mariadb: Repool db1110 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428601 (owner: 10Jcrespo) [11:54:10] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1110 with low load (duration: 00m 59s) [11:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:59] jouncebot: next [11:56:59] In 1 hour(s) and 3 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180424T1300) [11:57:04] PROBLEM - Host mw1275 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:11] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428607 [11:57:16] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428607 [11:59:06] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428607 (owner: 10Marostegui) [11:59:32] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1110 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428598 [12:00:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428607 (owner: 10Marostegui) [12:01:10] (03PS3) 10Jcrespo: Revert "mariadb: Depool db1110 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428598 [12:01:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1088 after alter table (duration: 00m 58s) [12:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428610 (https://phabricator.wikimedia.org/T190148) [12:04:14] jynus: I will wait for you before merging ^ [12:04:43] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428607 (owner: 10Marostegui) [12:05:59] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [12:07:03] (03CR) 10Ladsgroup: "yeah sure, let me make a patch for that." [puppet] - 10https://gerrit.wikimedia.org/r/428297 (https://phabricator.wikimedia.org/T189596) (owner: 10Ladsgroup) [12:07:26] I cannot merge yet [12:07:40] RECOVERY - Check whether ferm is active by checking the default input chain on mw1225 is OK: OK ferm input default policy is set [12:07:40] RECOVERY - configured eth on mw1225 is OK: OK - interfaces up [12:07:41] it will take some time for the buffer pool to get full [12:07:59] Ah ok, then I will proceed :) [12:08:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428610 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [12:08:49] RECOVERY - dhclient process on mw1226 is OK: PROCS OK: 0 processes with command name dhclient [12:08:49] RECOVERY - DPKG on mw1226 is OK: All packages OK [12:09:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428610 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [12:09:59] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:10:01] (03PS1) 10Ladsgroup: mediawiki: stop deleting autopatrol logs temporarily [puppet] - 10https://gerrit.wikimedia.org/r/428613 [12:10:53] !log reimage analytics106[34] to Debian Stretch [12:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1098:3316 for alter table (duration: 00m 58s) [12:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:21] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428610 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [12:12:09] (03PS5) 10Ema: VCL: only parse X-Connection-Properties if available [puppet] - 10https://gerrit.wikimedia.org/r/428580 [12:12:11] (03PS6) 10Ema: VCL: 400 on empty/unparseable Host header values [puppet] - 10https://gerrit.wikimedia.org/r/428594 [12:13:31] !log Deploy schema change on db1098:3316 - T191519 T188299 T190148 [12:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:38] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [12:13:38] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [12:13:38] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [12:16:48] 10Operations, 10Graphite: Restore Graphite whipser data from April 23th - https://phabricator.wikimedia.org/T192899#4153360 (10Gilles) [12:17:05] 10Operations, 10Graphite: Restore Graphite whipser data from April 23th - https://phabricator.wikimedia.org/T192899#4153360 (10Gilles) a:05Gilles>03None [12:17:24] 10Operations, 10Analytics, 10Graphite: Restore Graphite whipser data from April 23th - https://phabricator.wikimedia.org/T192899#4153360 (10Gilles) [12:17:42] 10Operations, 10Analytics, 10Graphite: Restore Graphite whipser data from April 23th - https://phabricator.wikimedia.org/T192899#4153360 (10Gilles) [12:18:30] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1224 is OK: OK: synced at Tue 2018-04-24 12:18:25 UTC. [12:19:11] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [12:19:11] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [12:19:11] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [12:19:50] PROBLEM - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:20:50] RECOVERY - Check systemd state on restbase-dev1005 is OK: OK - running: The system is fully operational [12:21:16] (03CR) 10Ladsgroup: [C: 031] Set default for $wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428565 (https://phabricator.wikimedia.org/T188456) (owner: 10Hoo man) [12:25:50] PROBLEM - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:26:13] (03PS1) 10Jcrespo: mariadb: Increase db1110 load a bit more after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428617 [12:27:33] (03CR) 10Jcrespo: [C: 032] mariadb: Increase db1110 load a bit more after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428617 (owner: 10Jcrespo) [12:28:41] !log cleanup /home/elukey/zookeeper backup files taken before the 3.4.9 migration [12:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:48] where [12:28:49] (03Merged) 10jenkins-bot: mariadb: Increase db1110 load a bit more after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428617 (owner: 10Jcrespo) [12:28:52] ufffff [12:28:56] * elukey amends [12:30:17] (03CR) 10jenkins-bot: mariadb: Increase db1110 load a bit more after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428617 (owner: 10Jcrespo) [12:31:58] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Increase db1110 load (duration: 00m 58s) [12:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:32] (03PS1) 10Jcrespo: mariadb: Depool db1109 for reimage to stretch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428620 [12:35:14] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [12:37:44] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [12:37:44] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [12:37:44] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [12:39:15] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:42:45] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:44:07] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1109 for reimage to stretch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428620 (owner: 10Jcrespo) [12:45:13] 10Operations, 10ops-eqiad: Broken memory/CPU on mw1275 - https://phabricator.wikimedia.org/T192902#4153443 (10MoritzMuehlenhoff) [12:45:20] (03Merged) 10jenkins-bot: mariadb: Depool db1109 for reimage to stretch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428620 (owner: 10Jcrespo) [12:46:04] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1063 is CRITICAL: NRPE: Command check_ferm_active not defined [12:48:08] I am reimaging 106[34] --^ [12:48:13] it should be downtimed though [12:49:27] (03CR) 10jenkins-bot: mariadb: Depool db1109 for reimage to stretch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428620 (owner: 10Jcrespo) [12:50:14] RECOVERY - Check systemd state on restbase-dev1005 is OK: OK - running: The system is fully operational [12:51:05] (03PS1) 10Ladsgroup: Increase the timespan of rate limit in wikidata from 1m to 5m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428623 (https://phabricator.wikimedia.org/T192690) [12:51:08] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1109 (duration: 00m 58s) [12:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:51] (03PS1) 10Ladsgroup: Clean up old config for logging autopatrol actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428624 (https://phabricator.wikimedia.org/T184485) [12:54:44] PROBLEM - Hadoop DataNode on analytics1063 is CRITICAL: NRPE: Command check_hadoop-hdfs-datanode not defined [12:55:27] reimage --^ [12:56:41] rename is running quite slow since last weekend [12:56:45] anything wrong with it? [12:56:53] ACKNOWLEDGEMENT - Host mw1275 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T192902 [12:57:00] (03PS1) 10Ladsgroup: Remove xx-uca-fa for Persian Wikis except Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428626 [12:57:03] (well... definition of slow and fast is just my observation) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180424T1300). [13:00:04] hoo and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] Present [13:00:30] I can SWAT today [13:00:54] Cool :) [13:01:16] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1063 is OK: OK ferm input default policy is set [13:01:17] new policy since yesterday https://wikitech.wikimedia.org/w/index.php?title=SWAT_deploys&type=revision&diff=1789212&oldid=1777024 [13:01:32] hoo: want to deploy your own change, or should I [13:01:47] RECOVERY - Hadoop DataNode on analytics1063 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [13:02:09] I added four patches to the SWAT :D [13:02:21] Amir1: :D deploying yourself? [13:02:35] zeljkof: Will do it myself :) [13:02:51] hoo: go ahead while I review Urbanecm's commit [13:02:56] ack [13:03:02] sure thing, it's four different patches so I should probably go last [13:03:08] Amir1: since you have 4 patches, you are last then :) [13:03:09] (03CR) 10Hoo man: [C: 032] Set default for $wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428565 (https://phabricator.wikimedia.org/T188456) (owner: 10Hoo man) [13:03:17] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4153488 (10faidon) @Cmjohnson I think @BBlack's question above was for you -- task description seems to point at a few of the steps on your side being still pending at least. [13:03:23] (03CR) 10Jcrespo: [C: 032] mediawiki: stop deleting autopatrol logs temporarily [puppet] - 10https://gerrit.wikimedia.org/r/428613 (owner: 10Ladsgroup) [13:03:56] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [13:04:17] (03Merged) 10jenkins-bot: Set default for $wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428565 (https://phabricator.wikimedia.org/T188456) (owner: 10Hoo man) [13:04:34] Amir1, hoo: please note that there are a lot of these already in the logs: Undefined variable: wmgWikibaseSiteGroup in /srv/mediawiki/wmf-config/Wikibase.php on line [13:04:46] zeljkof: I know :S [13:05:16] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [13:05:35] It's a problem that should be resolved by hoo's patch (thank you hoo). This seems to be "high priority" (not UBN) so waiting for SWAT was ok [13:06:56] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Set default for $wmgWikibaseSiteGroup (T188456) (duration: 00m 59s) [13:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:03] T188456: Need to use the Wikidata Q for the WMRU site (Wikibase Client) - https://phabricator.wikimedia.org/T188456 [13:07:48] Seems to not do the trick [13:07:52] give me a moment, please [13:07:59] sure [13:09:16] (03CR) 10jenkins-bot: Set default for $wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428565 (https://phabricator.wikimedia.org/T188456) (owner: 10Hoo man) [13:09:16] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:09:56] PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:11:03] hoo: seems to me that the number of errors in the logs is slowly going down [13:11:04] ah, crap… SiteConfiguration doesn't allow us to set things to null explicitly [13:11:21] it was 220 a a few minutes ago, it's 213 now [13:11:29] 10Operations, 10Traffic: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368#4153519 (10ema) As of now, returning `deliver` instead of `fetch` is a valid mitigation for [[ https://github.com/varnishcache/varnish-cache/commit/c12a3e5e2bc978edafaccfe1c586e5d5dab01fcb| #1799 ]]. T... [13:12:46] RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [13:13:00] zeljkof: I guess we should continue and I'll try to deliver a patch in a moment [13:13:32] hoo: ok, I'll deploy Urbanecm's patch and you and Amir1 can coordinate for updates, ok? [13:13:40] Sounds good to me! [13:13:44] ack [13:13:50] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428597 (https://phabricator.wikimedia.org/T192827) (owner: 10Urbanecm) [13:14:04] Urbanecm: merging and deploying 428597 [13:14:08] ack [13:15:06] (03Merged) 10jenkins-bot: New throttle rule for IndigenizeWikipedia event, clean obsolete rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428597 (https://phabricator.wikimedia.org/T192827) (owner: 10Urbanecm) [13:15:20] (03CR) 10jenkins-bot: New throttle rule for IndigenizeWikipedia event, clean obsolete rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428597 (https://phabricator.wikimedia.org/T192827) (owner: 10Urbanecm) [13:17:10] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:428597|New throttle rule for IndigenizeWikipedia event, clean obsolete rules (T192827)]] (duration: 00m 58s) [13:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:16] T192827: Lift IP cap for Wikipedia:Meetup/IndigenizeWikipedia - https://phabricator.wikimedia.org/T192827 [13:17:18] Urbanecm: patch deployed [13:17:24] thx [13:17:28] hoo, Amir1: swat is yours [13:18:25] hoo: I deploy one or two until you get to something [13:18:55] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425589 (https://phabricator.wikimedia.org/T190976) (owner: 10Ladsgroup) [13:19:21] (03PS1) 10Hoo man: Properly set the default for $wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428629 (https://phabricator.wikimedia.org/T188456) [13:19:28] Amir1: ^ CR appreciated :) [13:19:32] zeljkof, will you have a time for my second patch? :) [13:19:54] (after Amir1's ofc) [13:20:15] (03Merged) 10jenkins-bot: Add badge for good lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425589 (https://phabricator.wikimedia.org/T190976) (owner: 10Ladsgroup) [13:20:18] Urbanecm: sure, Amir1 do you want to deploy it after your patches? (I can do it too) [13:20:35] Urbanecm, Amir1: just ping me when it's my turn [13:20:48] I can do it [13:21:04] (03PS1) 10Urbanecm: Enable Mapframe for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428630 (https://phabricator.wikimedia.org/T192895) [13:21:15] (03CR) 10jenkins-bot: Add badge for good lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425589 (https://phabricator.wikimedia.org/T190976) (owner: 10Ladsgroup) [13:23:39] Amir1, the patch was added to the Calendar. Please ping me as soon as you'll need me. [13:23:48] sure [13:26:38] (03PS1) 10Elukey: Allow the configuration of JMX and extra jvm opts for journal nodes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/428632 (https://phabricator.wikimedia.org/T192905) [13:26:45] 10Operations, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4153569 (10fgiunchedi) Looks like 3 out of 4 hosts have sda or sdb as one of the HDDs, not SSDs. The remaining host has sda/sdb as SSDs and two additional mdadm raid arrays. @Cmjohnson anything in the s... [13:28:50] !log ladsgroup@tin Synchronized wmf-config/Wikibase-production.php: [[gerrit:425589|Add badge for good lists (T190976)]] (duration: 00m 55s) [13:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:56] T190976: Add badge for good lists to Wikidata - https://phabricator.wikimedia.org/T190976 [13:30:31] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11013/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/428632 (https://phabricator.wikimedia.org/T192905) (owner: 10Elukey) [13:30:49] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428624 (https://phabricator.wikimedia.org/T184485) (owner: 10Ladsgroup) [13:31:07] (03PS2) 10Ladsgroup: Clean up old config for logging autopatrol actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428624 (https://phabricator.wikimedia.org/T184485) [13:32:33] (03CR) 10Filippo Giunchedi: "Nit inline, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428579 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [13:32:51] (03PS1) 10Marostegui: install_server: Allow install db1116-db1123 [puppet] - 10https://gerrit.wikimedia.org/r/428633 (https://phabricator.wikimedia.org/T191792) [13:33:21] (03CR) 10Ladsgroup: [C: 032] Clean up old config for logging autopatrol actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428624 (https://phabricator.wikimedia.org/T184485) (owner: 10Ladsgroup) [13:33:44] (03CR) 10Marostegui: [C: 032] install_server: Allow install db1116-db1123 [puppet] - 10https://gerrit.wikimedia.org/r/428633 (https://phabricator.wikimedia.org/T191792) (owner: 10Marostegui) [13:34:37] (03Merged) 10jenkins-bot: Clean up old config for logging autopatrol actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428624 (https://phabricator.wikimedia.org/T184485) (owner: 10Ladsgroup) [13:34:41] (03PS2) 10Ladsgroup: Increase the timespan of rate limit in wikidata from 1m to 5m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428623 (https://phabricator.wikimedia.org/T192690) [13:35:20] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4153597 (10Marostegui) @Cmjohnson I have enabled those hosts to get installed with the db.cfg recipe, so as soon as they start PXE booting they should get the correct installatio... [13:35:28] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [13:35:32] (03PS3) 10Filippo Giunchedi: base: alert on SMART health failure [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552) [13:35:43] (03CR) 10Filippo Giunchedi: base: alert on SMART health failure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [13:35:47] 10Operations, 10Code-Stewardship-Reviews, 10Services (watching): zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#4153598 (10faidon) So, this task has been open for a couple of months now, with the underlying issues have been present for far longer than that. In... [13:36:25] (03PS1) 10Muehlenhoff: Remove obsolete mediawiki multimedia packages [puppet] - 10https://gerrit.wikimedia.org/r/428634 [13:37:39] (03PS3) 10Ladsgroup: Increase the timespan of rate limit in wikidata from 1m to 5m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428623 (https://phabricator.wikimedia.org/T192690) [13:37:53] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:428624|Clean up old config for logging autopatrol actions (T184485)]] (duration: 00m 58s) [13:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:00] T184485: Stop logging autopatrol actions - https://phabricator.wikimedia.org/T184485 [13:39:17] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428623 (https://phabricator.wikimedia.org/T192690) (owner: 10Ladsgroup) [13:39:28] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:39:59] (03CR) 10Jcrespo: "I think that is not correct- the parameter has nothing to do with the class- it is a parameter of profile::base. The parameter should be e" [puppet] - 10https://gerrit.wikimedia.org/r/428579 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [13:40:01] (03CR) 10jenkins-bot: Clean up old config for logging autopatrol actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428624 (https://phabricator.wikimedia.org/T184485) (owner: 10Ladsgroup) [13:40:13] (03CR) 10Legoktm: [C: 04-1] "A quick guess, but I doubt this would have fixed the problem based on the MassMessage delivery size. The original commit should be reverte" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428623 (https://phabricator.wikimedia.org/T192690) (owner: 10Ladsgroup) [13:40:33] (03Merged) 10jenkins-bot: Increase the timespan of rate limit in wikidata from 1m to 5m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428623 (https://phabricator.wikimedia.org/T192690) (owner: 10Ladsgroup) [13:42:41] (03CR) 10Elukey: [C: 032] Allow the configuration of JMX and extra jvm opts for journal nodes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/428632 (https://phabricator.wikimedia.org/T192905) (owner: 10Elukey) [13:42:44] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1043 - https://phabricator.wikimedia.org/T192874#4153628 (10fgiunchedi) 05Open>03Invalid Host being setup in {T191896} [13:43:27] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:428623|Increase the timespan of rate limit in wikidata from 1m to 5m (T192690)]] (duration: 00m 58s) [13:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:33] T192690: Mass message broken on Wikidata after ratelimit workaround - https://phabricator.wikimedia.org/T192690 [13:44:22] (03PS1) 10Elukey: profile::hadoop::common: allow jvm opts for journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/428636 (https://phabricator.wikimedia.org/T192905) [13:44:31] Amir1: Can I take over? [13:44:50] And a quick +1 for https://gerrit.wikimedia.org/r/428629 would be nice ;) [13:45:12] (03PS2) 10Elukey: profile::hadoop::common: allow jvm opts for journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/428636 (https://phabricator.wikimedia.org/T192905) [13:45:44] (03CR) 10Ladsgroup: [C: 032] "I just your message, let's talk about it later and come to a conclusion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428623 (https://phabricator.wikimedia.org/T192690) (owner: 10Ladsgroup) [13:45:46] (03CR) 10jenkins-bot: Increase the timespan of rate limit in wikidata from 1m to 5m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428623 (https://phabricator.wikimedia.org/T192690) (owner: 10Ladsgroup) [13:46:08] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1977 bytes in 0.103 second response time [13:46:37] hoo: please [13:46:57] Thanks :) [13:47:03] (03CR) 10Hoo man: [C: 032] Properly set the default for $wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428629 (https://phabricator.wikimedia.org/T188456) (owner: 10Hoo man) [13:47:33] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11014/" [puppet] - 10https://gerrit.wikimedia.org/r/428636 (https://phabricator.wikimedia.org/T192905) (owner: 10Elukey) [13:48:43] (03CR) 10Gilles: Remove obsolete mediawiki multimedia packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428634 (owner: 10Muehlenhoff) [13:48:52] 10Operations, 10Analytics, 10Graphite: Restore Graphite whipser data from April 23th - https://phabricator.wikimedia.org/T192899#4153360 (10fgiunchedi) We're not backing up graphite's, though metrics are mirrored to codfw too so we can copy back from there. Which files you need? [13:49:18] 10Operations, 10Analytics, 10Graphite: Restore Graphite whipser data from April 23th - https://phabricator.wikimedia.org/T192899#4153650 (10Gilles) I nuked codfw as well... [13:49:37] 10Operations, 10Analytics, 10Graphite: Restore Graphite whipser data from April 23th - https://phabricator.wikimedia.org/T192899#4153651 (10Gilles) 05Open>03Invalid [13:49:50] gilles: doh! [13:50:23] it wasn't critical data, since we can't really compare it to the new stuff [13:50:42] I guess there are worse ways to learn that we rely on replication only and have no backups :) [13:51:07] hehe indeed, wiping everything out is one [13:51:51] but yeah too many files for bacula in graphite's case, though for limited subtrees of graphite's data we can backup [13:51:57] (03PS2) 10Hoo man: Properly set the default for $wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428629 (https://phabricator.wikimedia.org/T188456) [13:52:00] gotcha [13:52:08] PROBLEM - Check systemd state on snapshot1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:52:55] hoo: ping me when you're done [13:53:01] Amir1: Will do [13:53:03] Amir1, what's the status? [13:53:09] godog as part of datbase backups, we create tars of files for easier handling [13:53:16] Amir1: Just added https://gerrit.wikimedia.org/r/428290 to the SWAT on top [13:53:21] Urbanecm: two more deployments and we are done [13:53:27] it is automated and we have the code available [13:53:36] it also runs in parallel [13:53:45] ok. Will you have time for my patch as well? No problem with re-scheduling if it's a problem [13:53:55] (03PS1) 10Gehel: upgrade to upstream version 0.3.0 [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428638 [13:54:04] Urbanecm: one of them will be probably time-consuming [13:54:19] (03CR) 10Hoo man: [C: 032] Properly set the default for $wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428629 (https://phabricator.wikimedia.org/T188456) (owner: 10Hoo man) [13:54:39] ok, so do you think I should reschedule? Or just wait and see? we have 6 minutes [13:55:33] (03Merged) 10jenkins-bot: Properly set the default for $wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428629 (https://phabricator.wikimedia.org/T188456) (owner: 10Hoo man) [13:55:59] Urbanecm: I'd say yes. Sorry [13:56:26] nothing happened [13:56:58] !log hoo@tin Synchronized wmf-config/: Properly set default for $wmgWikibaseSiteGroup (T188456) (duration: 01m 00s) [13:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:06] T188456: Need to use the Wikidata Q for the WMRU site (Wikibase Client) - https://phabricator.wikimedia.org/T188456 [13:57:29] (03CR) 10Filippo Giunchedi: [C: 031] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/428579 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [13:57:31] (forgot to git rebase) [13:58:02] (03PS1) 10RobH: Revert "icinga: temp remove Rob from paging" [puppet] - 10https://gerrit.wikimedia.org/r/428639 [13:58:08] (03PS2) 10RobH: Revert "icinga: temp remove Rob from paging" [puppet] - 10https://gerrit.wikimedia.org/r/428639 [13:58:09] PROBLEM - puppet last run on cp5008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:58:24] !log hoo@tin Synchronized wmf-config/: Properly set default for $wmgWikibaseSiteGroup (T188456) (duration: 00m 58s) [13:58:28] Now it works [13:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:32] verified already [13:58:39] jynus: on the filesystem IIRC? i.e. no streamed? I'm asking because graphite's disk utilization is already way over 50% to be able to store the tars [13:58:40] (03CR) 10Hoo man: [C: 032] Grant Meta-Wiki sysops the ability to edit global abusefilter rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428290 (https://phabricator.wikimedia.org/T192722) (owner: 10MarcoAurelio) [13:58:42] (03CR) 10RobH: [C: 032] Revert "icinga: temp remove Rob from paging" [puppet] - 10https://gerrit.wikimedia.org/r/428639 (owner: 10RobH) [13:58:48] PROBLEM - puppet last run on mc2031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:58:48] PROBLEM - puppet last run on db2050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:58:49] PROBLEM - puppet last run on db2085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:59:09] PROBLEM - puppet last run on ms-be2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:59:18] PROBLEM - puppet last run on mw2258 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:59:36] (03CR) 10jenkins-bot: Properly set the default for $wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428629 (https://phabricator.wikimedia.org/T188456) (owner: 10Hoo man) [13:59:55] (03Merged) 10jenkins-bot: Grant Meta-Wiki sysops the ability to edit global abusefilter rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428290 (https://phabricator.wikimedia.org/T192722) (owner: 10MarcoAurelio) [14:00:14] (03CR) 10Filippo Giunchedi: "See inline" (031 comment) [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428638 (owner: 10Gehel) [14:00:47] 10Operations, 10Ops-Access-Requests: Access to Google Search Console, Tag Manager, and Analytics for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4153156 (10MoritzMuehlenhoff) The management of the Google Search Console is unfortunately quite limited. We can delegate access to invidual Google ac... [14:01:08] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1973 bytes in 0.124 second response time [14:01:09] PROBLEM - puppet last run on mc2035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:13] (03CR) 10Jcrespo: "note that the most similar hiera key, $notifications_enabled, follows the same pattern." [puppet] - 10https://gerrit.wikimedia.org/r/428579 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [14:01:27] !log hoo@tin Synchronized wmf-config/abusefilter.php: Grant Meta-Wiki sysops the ability to edit global abusefilter rules (T192722) (duration: 00m 59s) [14:01:28] PROBLEM - puppet last run on sarin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:33] T192722: Add abuse filter modification access for global filters to sysops on metawiki - https://phabricator.wikimedia.org/T192722 [14:01:43] (03PS1) 10Elukey: profile::hadoop::worker: deploy the journalnode prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/428640 (https://phabricator.wikimedia.org/T192905) [14:02:08] PROBLEM - puppet last run on labtestvirt2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:08] PROBLEM - puppet last run on mw2281 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:08] Looks good [14:02:09] PROBLEM - puppet last run on wtp2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:10] * hoo done [14:02:30] !log EU SWAT is done [14:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:58] PROBLEM - puppet last run on elastic2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:03:08] PROBLEM - puppet last run on mw2147 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:03:08] PROBLEM - puppet last run on dbstore2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:03:09] PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:03:16] (03PS3) 10Jcrespo: base: Disable atop daemon everywhere [puppet] - 10https://gerrit.wikimedia.org/r/428579 (https://phabricator.wikimedia.org/T192551) [14:03:26] akosiaris: did you just restart the ganeti with puppetdb2001? [14:03:34] that would explain the failures [14:03:37] (03PS2) 10Gehel: upgrade to upstream version 0.3.0 [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428638 (https://phabricator.wikimedia.org/T192768) [14:03:39] (03PS4) 10Jcrespo: base: Disable atop daemon everywhere [puppet] - 10https://gerrit.wikimedia.org/r/428579 (https://phabricator.wikimedia.org/T192551) [14:04:29] volans: yup [14:04:39] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11015/" [puppet] - 10https://gerrit.wikimedia.org/r/428640 (https://phabricator.wikimedia.org/T192905) (owner: 10Elukey) [14:04:39] that's exactly it [14:04:43] ack! a re-run of puppet works, so all seems good to go [14:04:50] (03CR) 10Filippo Giunchedi: [C: 031] upgrade to upstream version 0.3.0 [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428638 (https://phabricator.wikimedia.org/T192768) (owner: 10Gehel) [14:04:53] sorry, should have logged the puppetdb reboot explicitly [14:05:09] (03CR) 10Gehel: [C: 032] upgrade to upstream version 0.3.0 [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428638 (https://phabricator.wikimedia.org/T192768) (owner: 10Gehel) [14:05:24] (03CR) 10jenkins-bot: Grant Meta-Wiki sysops the ability to edit global abusefilter rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428290 (https://phabricator.wikimedia.org/T192722) (owner: 10MarcoAurelio) [14:06:06] no prob, I saw the SAL from this morning of the slow reboot ;) [14:06:28] RECOVERY - puppet last run on sarin is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:06:38] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [14:08:04] (03CR) 10Mark Bergsma: [C: 032] Cleanup module for consistency (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424009 (owner: 10Mark Bergsma) [14:08:37] (03Merged) 10jenkins-bot: Cleanup module for consistency [debs/pybal] - 10https://gerrit.wikimedia.org/r/424009 (owner: 10Mark Bergsma) [14:16:31] (03PS1) 10Ottomata: role::kafka::main - only include MirrorMaker in prod [puppet] - 10https://gerrit.wikimedia.org/r/428642 (https://phabricator.wikimedia.org/T192831) [14:17:48] (03PS2) 10Ottomata: role::kafka::main - only include MirrorMaker in prod [puppet] - 10https://gerrit.wikimedia.org/r/428642 (https://phabricator.wikimedia.org/T192831) [14:17:51] (03CR) 10Ottomata: [V: 032 C: 032] role::kafka::main - only include MirrorMaker in prod [puppet] - 10https://gerrit.wikimedia.org/r/428642 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [14:24:54] (03PS4) 10Mark Bergsma: Add test cases for implemented event 25 and fix OPENSENT [debs/pybal] - 10https://gerrit.wikimedia.org/r/424011 [14:26:08] RECOVERY - puppet last run on mc2035 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [14:27:08] RECOVERY - puppet last run on labtestvirt2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:27:09] RECOVERY - puppet last run on wtp2002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [14:27:58] RECOVERY - puppet last run on elastic2017 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [14:28:00] hmm, godog, i was about to set up promtheus jmx exporter stuff in deployment-prep [14:28:09] but i'm remembring that maybe labsprojects don't have puppetdb? [14:28:13] RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:28:14] and exported resources? is this true? [14:28:37] 10Operations, 10hardware-requests: request to assign WMF3565 as terbium equivalent - https://phabricator.wikimedia.org/T192185#4153789 (10Dzahn) a:03Dzahn [14:28:44] RECOVERY - puppet last run on db2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:28:44] RECOVERY - puppet last run on mc2031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:28:48] 10Operations, 10hardware-requests: request to assign WMF3565 as terbium equivalent - https://phabricator.wikimedia.org/T192185#4131143 (10Dzahn) p:05High>03Normal [14:28:53] RECOVERY - puppet last run on db2085 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:29:13] RECOVERY - puppet last run on ms-be2032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:29:13] RECOVERY - puppet last run on mw2258 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:29:29] ottomata: last I checked that was true yeah [14:29:31] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Matthias Geisler to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4153812 (10Dzahn) 05stalled>03Open [14:29:55] ok, won't even try then :) [14:29:59] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Matthias Geisler to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4108485 (10Dzahn) a:05Matthias_Geisler_WMDE>03RStallman-legalteam [14:30:13] (03PS2) 10Jcrespo: Rename 'm5' section to 'wikitech' and add explicit hostnames. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [14:30:24] (03PS1) 10Ottomata: Set default 'cluster' for profile::kafka::broker::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/428647 (https://phabricator.wikimedia.org/T192831) [14:31:27] (03CR) 10Jcrespo: "I think this is only missing the not read only/read only parts for eqiad and codfw." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [14:31:44] (03CR) 10Marostegui: [C: 031] base: Disable atop daemon everywhere [puppet] - 10https://gerrit.wikimedia.org/r/428579 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [14:31:58] (03CR) 10Jcrespo: "Manuel: Give it a second look, please, if you can." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [14:32:13] RECOVERY - puppet last run on mw2281 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:33:03] (03CR) 10Mark Bergsma: [C: 032] Add test cases for implemented event 25 and fix OPENSENT [debs/pybal] - 10https://gerrit.wikimedia.org/r/424011 (owner: 10Mark Bergsma) [14:33:36] (03Merged) 10jenkins-bot: Add test cases for implemented event 25 and fix OPENSENT [debs/pybal] - 10https://gerrit.wikimedia.org/r/424011 (owner: 10Mark Bergsma) [14:33:46] (03CR) 10Ottomata: [C: 032] Set default 'cluster' for profile::kafka::broker::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/428647 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [14:34:03] !log sbisson@tin Started deploy [kartotherian/deploy@86da82d]: Deploy latest kartotherian with updated fallbacks and support lang=local [14:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:21] (03CR) 10Elukey: [C: 032] profile::hadoop::common: allow jvm opts for journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/428636 (https://phabricator.wikimedia.org/T192905) (owner: 10Elukey) [14:34:26] (03PS3) 10Elukey: profile::hadoop::common: allow jvm opts for journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/428636 (https://phabricator.wikimedia.org/T192905) [14:35:54] (03CR) 10Marostegui: [C: 04-1] "Just a couple of quick comments" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [14:36:03] (03PS1) 10Ottomata: Set defaults for replica_maxlag broker monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/428648 (https://phabricator.wikimedia.org/T192831) [14:36:17] (03PS2) 10Elukey: profile::hadoop::worker: deploy the journalnode prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/428640 (https://phabricator.wikimedia.org/T192905) [14:36:53] RECOVERY - puppet last run on cp5008 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures [14:37:21] (03PS2) 10Ottomata: Set defaults for replica_maxlag broker monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/428648 (https://phabricator.wikimedia.org/T192831) [14:37:27] (03CR) 10Ottomata: [V: 032 C: 032] Set defaults for replica_maxlag broker monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/428648 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [14:37:32] (03CR) 10Elukey: [C: 032] profile::hadoop::worker: deploy the journalnode prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/428640 (https://phabricator.wikimedia.org/T192905) (owner: 10Elukey) [14:37:38] (03PS3) 10Elukey: profile::hadoop::worker: deploy the journalnode prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/428640 (https://phabricator.wikimedia.org/T192905) [14:37:41] (03CR) 10Elukey: [V: 032 C: 032] profile::hadoop::worker: deploy the journalnode prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/428640 (https://phabricator.wikimedia.org/T192905) (owner: 10Elukey) [14:38:00] (03CR) 10Muehlenhoff: Remove obsolete mediawiki multimedia packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428634 (owner: 10Muehlenhoff) [14:38:11] (03PS2) 10Muehlenhoff: Remove obsolete mediawiki multimedia packages [puppet] - 10https://gerrit.wikimedia.org/r/428634 [14:38:33] RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures [14:40:23] RECOVERY - puppet last run on mw2147 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures [14:40:32] !log sbisson@tin Finished deploy [kartotherian/deploy@86da82d]: Deploy latest kartotherian with updated fallbacks and support lang=local (duration: 06m 29s) [14:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:28] !log restart hadoop hdfs journalnode on analytics1028 to pick up jmx settings [14:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:46] (03PS1) 10Ottomata: Remove unneeded defaults from role/kafka/main.yaml hiera [puppet] - 10https://gerrit.wikimedia.org/r/428654 (https://phabricator.wikimedia.org/T192831) [14:49:33] (03CR) 10Ottomata: [C: 032] Remove unneeded defaults from role/kafka/main.yaml hiera [puppet] - 10https://gerrit.wikimedia.org/r/428654 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [14:51:43] (03PS1) 10Muehlenhoff: Switch scap proxy in A7 to mw1268 [puppet] - 10https://gerrit.wikimedia.org/r/428655 [14:52:16] (03PS1) 10Ottomata: Use profile::kafka::broker for main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/428656 (https://phabricator.wikimedia.org/T192831) [14:52:40] (03CR) 10jerkins-bot: [V: 04-1] Use profile::kafka::broker for main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/428656 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [14:54:05] (03PS1) 10Gehel: update symlinks with correct version number [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428657 [14:54:54] (03CR) 10Filippo Giunchedi: [C: 031] update symlinks with correct version number [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428657 (owner: 10Gehel) [14:55:19] (03CR) 10Gehel: [C: 032] update symlinks with correct version number [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428657 (owner: 10Gehel) [14:55:33] RECOVERY - MD RAID on ms-be1043 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:58:50] (03PS1) 10Ottomata: Add kafka_cluster label to Prometheus Kafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/428658 (https://phabricator.wikimedia.org/T192831) [14:59:48] (03CR) 10Ottomata: [C: 032] Add kafka_cluster label to Prometheus Kafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/428658 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [15:03:25] 10Operations, 10Ops-Access-Requests: Access to Google Search Console, Tag Manager, and Analytics for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4153917 (10Deskana) >>! In T192893#4153694, @MoritzMuehlenhoff wrote: > The management of the Google Search Console is unfortunately quite limited. We... [15:03:52] (03PS1) 10Imarlier: coal: Point systemd and uwsgi config to scap-deployed version [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) [15:04:19] (03CR) 10jerkins-bot: [V: 04-1] coal: Point systemd and uwsgi config to scap-deployed version [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [15:07:13] (03PS2) 10Imarlier: coal: Point systemd and uwsgi config to scap-deployed version [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) [15:11:35] 10Operations, 10Ops-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4153967 (10MoritzMuehlenhoff) [15:12:29] (03CR) 10Gilles: [C: 031] Remove obsolete mediawiki multimedia packages [puppet] - 10https://gerrit.wikimedia.org/r/428634 (owner: 10Muehlenhoff) [15:17:26] 10Operations, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4153996 (10fgiunchedi) @Cmjohnson confirmed raid config is the same on all of those, I rebooted the hosts showing the incorrect order and indeed upon reboot the order is as expected: ``` ===== NODE GROU... [15:18:31] (03CR) 10Gilles: [C: 031] coal: Point systemd and uwsgi config to scap-deployed version [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [15:18:58] (03CR) 10Anomie: "The related core patch also removes 'edituserjs' from sysop. Should that also be accounted for here?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421123 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [15:22:42] (03PS1) 10Filippo Giunchedi: Add ms-be104[0123] to swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/428661 (https://phabricator.wikimedia.org/T191896) [15:23:23] (03PS1) 10Gehel: updated build documentation [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428662 [15:25:21] (03CR) 10Imarlier: "Puppet compiler run looks good to me..." [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [15:25:35] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes1001.eqiad.wmnet, kubernetes1002.eqiad.wmnet are marked down but pooled [15:25:36] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet are marked down but pooled [15:25:53] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler02/11018/" [puppet] - 10https://gerrit.wikimedia.org/r/428661 (https://phabricator.wikimedia.org/T191896) (owner: 10Filippo Giunchedi) [15:26:27] Anyone have a couple of minutes to check over and merge a puppet change? https://gerrit.wikimedia.org/r/#/c/428659/ [15:27:36] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [15:27:36] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [15:28:38] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: 1.137e+05 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:28:48] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: 1.105e+05 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:28:58] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: 2.199e+05 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:29:08] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: 2.287e+05 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:31:54] akosiaris: FYI, trouble with mathoid? ^ [15:34:14] https://grafana.wikimedia.org/dashboard/db/pybal?panelId=8&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-server=lvs1003&var-service=mathoid_10042&from=1524582989109&to=1524584005208 [15:36:05] "kubelet operational latencies [15:36:26] criticals for kubernetes100[1-4] [15:38:44] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: (C)1.5e+04 ge (W)1e+04 ge 4300 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:38:53] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: (C)1.5e+04 ge (W)1e+04 ge 6131 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:39:04] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: (C)1.5e+04 ge (W)1e+04 ge 4869 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:39:14] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: (C)1.5e+04 ge (W)1e+04 ge 4829 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:40:11] the number of pods for mathoid has been increased from 4 to 16 earlier this month, so maybe this time that wasn't the issue? [15:41:50] "operations eqiad by node" did go up a bit https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1&from=1524582517180&to=1524584469817 [15:44:50] PROBLEM - Nginx local proxy to apache on mw2235 is CRITICAL: connect to address 10.192.0.61 and port 443: Connection refused [15:44:50] PROBLEM - Check systemd state on mw2235 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:46:30] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2235 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:46:49] ehm.. that is me reinstalling it [15:46:53] but it should have been downtimed [15:48:11] PROBLEM - Check whether ferm is active by checking the default input chain on mw2235 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:48:11] PROBLEM - configured eth on mw2235 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:48:53] it's during the first puppet run [15:49:51] PROBLEM - dhclient process on mw2235 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:49:51] PROBLEM - DPKG on mw2235 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:50:08] (03PS4) 10Jcrespo: Revert "mariadb: Depool db1110 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428598 [15:50:30] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1110 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428598 (owner: 10Jcrespo) [15:51:12] (03PS1) 10Elukey: role::analytics_cluter::hadoop::worker: enable jmx metrics for journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/428669 (https://phabricator.wikimedia.org/T192905) [15:51:40] PROBLEM - mediawiki-installation DSH group on mw2235 is CRITICAL: Host mw2235 is not in mediawiki-installation dsh group [15:51:40] PROBLEM - Disk space on mw2235 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:51:55] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1110 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428598 (owner: 10Jcrespo) [15:52:11] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1110 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428598 (owner: 10Jcrespo) [15:53:20] PROBLEM - HHVM processes on mw2235 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:53:21] PROBLEM - nutcracker port on mw2235 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:53:31] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4154242 (10Cmjohnson) @Vgutierrez Sorry about that, it was set but I had an extra . in the subnet. Anyway, that is fixed. Also, I am not sure which image you want to instal... [15:53:55] mutante: there is a race condition where depending when puppet runs and when puppet runs on icinga host, an alerts can be raised [15:54:08] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1110 with full weight (duration: 00m 58s) [15:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:35] the best way to avoid that is to setup notifications_enable:0 with hiera keys for new hosts [15:54:39] jynus: makes sense yep, thanks. that's like what i'm seeing. most of the times it works though [15:54:59] yes, it is a very narrow window [15:55:00] PROBLEM - HHVM rendering on mw2235 is CRITICAL: connect to address 10.192.0.61 and port 80: Connection refused [15:55:01] PROBLEM - nutcracker process on mw2235 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:55:03] while puppet is still running [15:56:50] PROBLEM - puppet last run on mw2235 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:59:13] !log reimage restbase1010 after ssd swap - T189822 [15:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:19] T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822 [16:00:04] godog, moritzm, and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180424T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:11] PROBLEM - Apache HTTP on mw2235 is CRITICAL: connect to address 10.192.0.61 and port 80: Connection refused [16:00:20] PROBLEM - MD RAID on mw2235 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:01:55] (03CR) 10Elukey: [C: 032] role::analytics_cluter::hadoop::worker: enable jmx metrics for journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/428669 (https://phabricator.wikimedia.org/T192905) (owner: 10Elukey) [16:02:21] PROBLEM - Check whether ferm is active by checking the default input chain on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:02:21] PROBLEM - configured eth on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:04:01] PROBLEM - DPKG on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:04:01] PROBLEM - dhclient process on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:05:41] PROBLEM - mediawiki-installation DSH group on mw2236 is CRITICAL: Host mw2236 is not in mediawiki-installation dsh group [16:05:41] PROBLEM - Disk space on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:05:56] 10Operations, 10Ops-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4154323 (10Deskana) [16:06:18] (03PS1) 10Bstorm: Revert "wiki replicas: Depool labsdb1011 for MCR table additions" [puppet] - 10https://gerrit.wikimedia.org/r/428672 [16:06:38] (03PS2) 10Marostegui: Revert "wiki replicas: Depool labsdb1011 for MCR table additions" [puppet] - 10https://gerrit.wikimedia.org/r/428672 (owner: 10Bstorm) [16:07:13] (03PS1) 10Elukey: Remove duplicate declaration of journalnode opts [puppet/cdh] - 10https://gerrit.wikimedia.org/r/428673 [16:07:30] PROBLEM - HHVM processes on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:07:30] PROBLEM - nutcracker port on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:07:31] (03CR) 10Elukey: [V: 032 C: 032] Remove duplicate declaration of journalnode opts [puppet/cdh] - 10https://gerrit.wikimedia.org/r/428673 (owner: 10Elukey) [16:07:33] (03CR) 10Marostegui: [C: 032] Revert "wiki replicas: Depool labsdb1011 for MCR table additions" [puppet] - 10https://gerrit.wikimedia.org/r/428672 (owner: 10Bstorm) [16:07:40] (03PS1) 10Andrew Bogott: realm.pp: allow hosts in .labtest [puppet] - 10https://gerrit.wikimedia.org/r/428674 [16:08:03] !log Reload haproxy on dbproxy1010 to repool labsdb1011 - https://phabricator.wikimedia.org/T184446 [16:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:47] (03PS1) 10Elukey: Update the cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/428676 [16:09:06] (03CR) 10Elukey: [V: 032 C: 032] Update the cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/428676 (owner: 10Elukey) [16:09:10] PROBLEM - HHVM rendering on mw2236 is CRITICAL: connect to address 10.192.0.62 and port 80: Connection refused [16:09:11] PROBLEM - nutcracker process on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:09:21] RECOVERY - Apache HTTP on mw2235 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [16:10:12] !log Added views for new MCR tables on labsdb1011 (slots, slot_roles, content and content_models) [16:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:51] PROBLEM - puppet last run on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:11:26] !log restart hadoop-hdfs-journalnode on analytics1028 to pick up prometheus monitoring [16:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:32] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4154368 (10Krinkle) [16:11:35] (03CR) 10Brion VIBBER: [C: 031] "Looks good. Libraries will be pulled in via ffmpeg, so don't need to be mentioned." [puppet] - 10https://gerrit.wikimedia.org/r/428634 (owner: 10Muehlenhoff) [16:14:20] PROBLEM - Apache HTTP on mw2236 is CRITICAL: connect to address 10.192.0.62 and port 80: Connection refused [16:14:20] PROBLEM - MD RAID on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:14:26] (03CR) 10Herron: [C: 04-1] "Couple of comments about the sudoers rules, otherwise looks good to me" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [16:16:01] PROBLEM - Check size of conntrack table on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:16:11] (03PS2) 10Andrew Bogott: realm.pp: allow hosts in .labtest [puppet] - 10https://gerrit.wikimedia.org/r/428674 [16:16:20] RECOVERY - Apache HTTP on mw2236 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [16:16:44] 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite#001 nodes to webperf#001 - https://phabricator.wikimedia.org/T159354#4154389 (10Krinkle) p:05Low>03Normal [16:17:08] (03CR) 10Andrew Bogott: [C: 032] realm.pp: allow hosts in .labtest [puppet] - 10https://gerrit.wikimedia.org/r/428674 (owner: 10Andrew Bogott) [16:17:20] (03CR) 10Arturo Borrero Gonzalez: [C: 031] realm.pp: allow hosts in .labtest [puppet] - 10https://gerrit.wikimedia.org/r/428674 (owner: 10Andrew Bogott) [16:17:50] PROBLEM - Nginx local proxy to apache on mw2236 is CRITICAL: connect to address 10.192.0.62 and port 443: Connection refused [16:17:50] PROBLEM - Check systemd state on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:19:25] (03CR) 10Imarlier: coal: Point systemd and uwsgi config to scap-deployed version (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [16:19:30] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:19:34] 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite#001 nodes to webperf#001 - https://phabricator.wikimedia.org/T159354#4154398 (10Krinkle) [16:19:48] (03PS3) 10Imarlier: coal: Point systemd and uwsgi config to scap-deployed version [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) [16:20:04] (03CR) 10Brion VIBBER: [C: 031] "Looks good; as noted inline the fonts include can probably go too." [puppet] - 10https://gerrit.wikimedia.org/r/428300 (owner: 10Muehlenhoff) [16:24:21] PROBLEM - HHVM rendering on mw2236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:00] RECOVERY - Disk space on mw2235 is OK: DISK OK [16:25:01] RECOVERY - HHVM processes on mw2235 is OK: PROCS OK: 6 processes with command name hhvm [16:25:20] RECOVERY - dhclient process on mw2235 is OK: PROCS OK: 0 processes with command name dhclient [16:25:20] RECOVERY - DPKG on mw2235 is OK: All packages OK [16:25:32] RECOVERY - Check whether ferm is active by checking the default input chain on mw2235 is OK: OK ferm input default policy is set [16:25:32] RECOVERY - configured eth on mw2235 is OK: OK - interfaces up [16:25:32] RECOVERY - MD RAID on mw2235 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:26:31] PROBLEM - Apache HTTP on mw2236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:44] (03PS3) 10Andrew Bogott: Rename 'm5' section to 'wikitech' and add explicit hostnames. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) [16:28:59] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: 3.228e+04 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:29:08] Hmm. So… if a wg* setting is set to an explicit value in InitialiseSettings it's read and set, but if it's set to null do we explicitly set it to null, or do we not set it? Might explain T192855. [16:29:08] T192855: Strange problem on ar.wiki - https://phabricator.wikimedia.org/T192855 [16:30:06] (03CR) 10Marostegui: "What should we do with the read only parts missing Jaime was mentioning in his previous comments?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [16:30:29] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes1001.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet are marked down but pooled [16:30:34] (03PS5) 10Muehlenhoff: mediawiki::packages::fonts: Consistently use require_package [puppet] - 10https://gerrit.wikimedia.org/r/420670 [16:30:39] PROBLEM - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is CRITICAL: /{format}/ (mass-energy equivalence (texvcinfo)) timed out before a response was received: /{format}/ (Invaid command (texvcinfo)) timed out before a response was received: / (spec from root) timed out before a response was received: / (mass-energy equivalence (json)) timed out before a response was received [16:30:41] !log restart hadoop hdfs journalnode on analytics1035/52 to pick up prometheus jmx settings [16:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:55] (03CR) 10Filippo Giunchedi: updated build documentation (031 comment) [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428662 (owner: 10Gehel) [16:31:00] PROBLEM - puppet last run on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:31:09] PROBLEM - Disk space on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:31:09] PROBLEM - Check systemd state on mw2236 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:31:29] PROBLEM - nutcracker process on mw2236 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker [16:31:30] RECOVERY - MD RAID on mw2236 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:31:39] RECOVERY - Check whether ferm is active by checking the default input chain on mw2236 is OK: OK ferm input default policy is set [16:31:40] RECOVERY - configured eth on mw2236 is OK: OK - interfaces up [16:31:40] RECOVERY - HHVM processes on mw2236 is OK: PROCS OK: 6 processes with command name hhvm [16:31:49] PROBLEM - nutcracker port on mw2236 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [16:31:50] RECOVERY - Check size of conntrack table on mw2236 is OK: OK: nf_conntrack is 0 % full [16:31:59] RECOVERY - Disk space on mw2236 is OK: DISK OK [16:32:30] RECOVERY - nutcracker process on mw2235 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [16:32:35] (03PS2) 10Gehel: updated build documentation [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428662 [16:32:40] PROBLEM - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) timed out before a response was received [16:32:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428680 [16:32:50] (03CR) 10Gehel: updated build documentation (031 comment) [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428662 (owner: 10Gehel) [16:32:58] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428680 [16:32:59] RECOVERY - nutcracker port on mw2235 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [16:33:02] jouncebot: next [16:33:02] In 0 hour(s) and 26 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180424T1700) [16:33:39] RECOVERY - DPKG on mw2236 is OK: All packages OK [16:33:39] RECOVERY - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is OK: All endpoints are healthy [16:33:40] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: 3.566e+05 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:33:49] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: 4.613e+05 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:34:10] RECOVERY - Nginx local proxy to apache on mw2235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.218 second response time [16:34:29] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [16:35:30] RECOVERY - HHVM rendering on mw2235 is OK: HTTP OK: HTTP/1.1 200 OK - 74372 bytes in 0.324 second response time [16:35:30] (03PS1) 10Ottomata: Use kafka_cluster label instead of cluster for broker alerts [puppet] - 10https://gerrit.wikimedia.org/r/428681 (https://phabricator.wikimedia.org/T192831) [16:35:41] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428680 (owner: 10Marostegui) [16:36:49] RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:37:03] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428680 (owner: 10Marostegui) [16:37:15] (03CR) 10Andrew Bogott: "To quote my earlier self, "I don't know how to mark a db server as read-only. Also, labtestwikitech is in codfw, so I'm not sure clear on " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [16:37:34] (03CR) 10Ottomata: [C: 032] Use kafka_cluster label instead of cluster for broker alerts [puppet] - 10https://gerrit.wikimedia.org/r/428681 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [16:37:41] (03PS1) 10Gehel: fixed typo in distribution name [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428682 [16:38:09] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: (C)1.5e+04 ge (W)1e+04 ge 3669 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:38:24] (03CR) 10Filippo Giunchedi: [C: 031] fixed typo in distribution name [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428682 (owner: 10Gehel) [16:38:29] RECOVERY - nutcracker process on mw2236 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [16:38:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1098:3316 after alter table (duration: 00m 58s) [16:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:40] RECOVERY - Apache HTTP on mw2236 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 9.791 second response time [16:38:46] (03PS1) 10Muehlenhoff: Switch scap proxy in B6 to mw1285 [puppet] - 10https://gerrit.wikimedia.org/r/428683 [16:38:48] (03CR) 10Gehel: [C: 032] updated build documentation [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428662 (owner: 10Gehel) [16:38:50] RECOVERY - nutcracker port on mw2236 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [16:38:54] (03CR) 10Gehel: [C: 032] fixed typo in distribution name [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/428682 (owner: 10Gehel) [16:38:59] RECOVERY - dhclient process on mw2236 is OK: PROCS OK: 0 processes with command name dhclient [16:39:09] RECOVERY - Check systemd state on mw2236 is OK: OK - running: The system is fully operational [16:39:09] RECOVERY - Nginx local proxy to apache on mw2236 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 1.681 second response time [16:39:40] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: (C)1.5e+04 ge (W)1e+04 ge 4860 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:39:59] PROBLEM - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:40:26] (03PS1) 10Marostegui: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428684 (https://phabricator.wikimedia.org/T190148) [16:40:29] RECOVERY - HHVM rendering on mw2236 is OK: HTTP OK: HTTP/1.1 200 OK - 74372 bytes in 0.420 second response time [16:40:31] !log mobrovac@tin Started deploy [restbase/deploy@fbce520]: Set the delete probability to 100% - T192689 [16:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:37] T192689: Unchecked storage growth(?) - https://phabricator.wikimedia.org/T192689 [16:41:54] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428680 (owner: 10Marostegui) [16:42:50] (03CR) 10Marostegui: "> To quote my earlier self, "I don't know how to mark a db server as" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [16:43:01] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428684 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [16:43:44] (03PS1) 10Cmjohnson: Updating dhcpd file w/db1016-17 [puppet] - 10https://gerrit.wikimedia.org/r/428685 (https://phabricator.wikimedia.org/T191792) [16:44:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428684 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [16:44:23] (03PS2) 10Cmjohnson: Updating dhcpd file w/db1016-17 [puppet] - 10https://gerrit.wikimedia.org/r/428685 (https://phabricator.wikimedia.org/T191792) [16:45:06] (03PS2) 10Ottomata: Use profile::kafka::broker for main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/428656 (https://phabricator.wikimedia.org/T192831) [16:45:29] (03CR) 10jerkins-bot: [V: 04-1] Use profile::kafka::broker for main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/428656 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [16:45:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1113:3316 for alter table (duration: 00m 58s) [16:45:32] (03CR) 10Cmjohnson: [C: 032] Updating dhcpd file w/db1016-17 [puppet] - 10https://gerrit.wikimedia.org/r/428685 (https://phabricator.wikimedia.org/T191792) (owner: 10Cmjohnson) [16:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:42] RECOVERY - Check systemd state on mw2235 is OK: OK - running: The system is fully operational [16:45:42] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [16:45:46] !log Deploy schema change on db1113:3316 - T191519 T188299 T190148 [16:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:54] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [16:45:54] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [16:45:54] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [16:46:02] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:46:32] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2235 is OK: OK: synced at Tue 2018-04-24 16:46:24 UTC. [16:47:01] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4154547 (10Cmjohnson) [16:47:06] (03PS3) 10Ottomata: Use profile::kafka::broker for main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/428656 (https://phabricator.wikimedia.org/T192831) [16:47:42] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [16:49:32] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2236 is OK: OK: synced at Tue 2018-04-24 16:49:24 UTC. [16:49:55] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428684 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [16:50:32] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [16:52:10] !log mobrovac@tin Finished deploy [restbase/deploy@fbce520]: Set the delete probability to 100% - T192689 (duration: 11m 40s) [16:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:16] T192689: Unchecked storage growth(?) - https://phabricator.wikimedia.org/T192689 [16:52:32] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [16:52:55] (03PS1) 10Cmjohnson: Fixing dhcpd file for db1116/1117 [puppet] - 10https://gerrit.wikimedia.org/r/428688 (https://phabricator.wikimedia.org/T191792) [16:53:07] (03PS2) 10Cmjohnson: Fixing dhcpd file for db1116/1117 [puppet] - 10https://gerrit.wikimedia.org/r/428688 (https://phabricator.wikimedia.org/T191792) [16:53:50] 10Operations, 10ops-eqiad, 10Cassandra, 10hardware-requests, and 3 others: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4154572 (10fgiunchedi) a:05Cmjohnson>03Eevans I've gone ahead and reimaged restbase1010, all cassandra instances are masked... [16:54:00] (03CR) 10Cmjohnson: [C: 032] Fixing dhcpd file for db1116/1117 [puppet] - 10https://gerrit.wikimedia.org/r/428688 (https://phabricator.wikimedia.org/T191792) (owner: 10Cmjohnson) [16:54:50] (03PS4) 10Ottomata: Use profile::kafka::broker for main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/428656 (https://phabricator.wikimedia.org/T192831) [16:54:52] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: (C)1.5e+04 ge (W)1e+04 ge 4874 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:54:54] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Matthias Geisler to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4154576 (10RStallman-legalteam) Matthias's NDA is fully signed and on file with legal. Thanks! [16:55:16] (03CR) 10jerkins-bot: [V: 04-1] Use profile::kafka::broker for main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/428656 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [16:55:50] !log mobrovac@tin Started deploy [restbase/deploy@fbce520]: Set the delete probability to 100% - T192689 [16:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:16] (03CR) 10Ottomata: "Mostly no op, except for graphite/jmxtrans vs prometheus stuff https://puppet-compiler.wmflabs.org/compiler02/11021/kafka2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/428656 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [16:57:08] !log starting Cassandra bootstrap, restbase1010-a -- T189822 [16:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:14] T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822 [16:57:15] (03PS5) 10Ottomata: Use profile::kafka::broker for main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/428656 (https://phabricator.wikimedia.org/T192831) [16:59:32] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:59:36] godog: uh oh [16:59:42] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:59:43] openjdk version [16:59:52] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [17:00:02] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180424T1700). [17:00:13] wah wah [17:00:20] no parsoid deploy today [17:00:32] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [17:00:42] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [17:01:02] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [17:01:17] !log mobrovac@tin Finished deploy [restbase/deploy@fbce520]: Set the delete probability to 100% - T192689 (duration: 05m 27s) [17:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:24] T192689: Unchecked storage growth(?) - https://phabricator.wikimedia.org/T192689 [17:01:52] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [17:02:07] !log mobrovac@tin Started deploy [restbase/deploy@fbce520]: Set the delete probability to 100% [deploy to restbase1010] - T192689 [17:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:07] (03CR) 10Krinkle: coal: Point systemd and uwsgi config to scap-deployed version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [17:04:11] !log mobrovac@tin Finished deploy [restbase/deploy@fbce520]: Set the delete probability to 100% [deploy to restbase1010] - T192689 (duration: 02m 04s) [17:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:42] (03PS1) 10Arturo Borrero Gonzalez: labs_bootstrapvz: add labtest-jessie specific firstboot.sh script [puppet] - 10https://gerrit.wikimedia.org/r/428694 (https://phabricator.wikimedia.org/T181523) [17:08:31] (03PS1) 10Elukey: profile::hadoop::monitoring::journalnode: correct port [puppet] - 10https://gerrit.wikimedia.org/r/428695 (https://phabricator.wikimedia.org/T192905) [17:09:08] (03CR) 10Elukey: [C: 032] profile::hadoop::monitoring::journalnode: correct port [puppet] - 10https://gerrit.wikimedia.org/r/428695 (https://phabricator.wikimedia.org/T192905) (owner: 10Elukey) [17:17:21] (03CR) 10Imarlier: coal: Point systemd and uwsgi config to scap-deployed version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [17:19:04] 10Operations, 10Patch-For-Review, 10User-Elukey: Apache reload fails on stretch-based app servers - https://phabricator.wikimedia.org/T185195#4154624 (10MoritzMuehlenhoff) Some initial tests suggest that amending the TMPREAPER_PROTECT_EXTRA setting shipped in tmpreaper might fix this, but I need to run some... [17:22:18] (03PS1) 10Jforrester: Explcitly set driver for Tidy wikis to RaggettInternal* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428697 (https://phabricator.wikimedia.org/T192855) [17:23:06] (03CR) 10jerkins-bot: [V: 04-1] Explcitly set driver for Tidy wikis to RaggettInternal* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428697 (https://phabricator.wikimedia.org/T192855) (owner: 10Jforrester) [17:24:12] (03PS2) 10Jforrester: Explcitly set driver for Tidy wikis to RaggettInternal* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428697 (https://phabricator.wikimedia.org/T192855) [17:26:09] 10Operations, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#4154651 (10fgiunchedi) >>! In T161296#4149818, @jcrespo wrote: > This could be done massively right now. Missing hosts with 0.9.0 still (that are not set as spares,... [17:27:10] (03CR) 10Herron: coal: Point systemd and uwsgi config to scap-deployed version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [17:27:44] (03CR) 10Subramanya Sastry: [C: 031] "LGTM. Needs testing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428697 (https://phabricator.wikimedia.org/T192855) (owner: 10Jforrester) [17:29:39] !log added MCR tables to labsdb1009 (slots, slot_roles, content_models, content) [17:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:13] (03PS1) 10Chad: Group0 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428702 [17:33:34] (03PS1) 10Ottomata: Use different GC opts for Kafka in Java 7 vs Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/428703 (https://phabricator.wikimedia.org/T192831) [17:35:08] !log removing firewall block on cr1-eqdfw - T175361 [17:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:14] T175361: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361 [17:35:33] (03PS2) 10Ottomata: Use different GC opts for Kafka in Java 7 vs Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/428703 (https://phabricator.wikimedia.org/T192831) [17:35:59] !log removing firewall block on cr1/2-codfw - T175361 [17:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:51] (03CR) 10Ottomata: [C: 032] "no op https://puppet-compiler.wmflabs.org/compiler02/11024/" [puppet] - 10https://gerrit.wikimedia.org/r/428703 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [17:36:56] (03CR) 10Ottomata: [C: 032] Use different GC opts for Kafka in Java 7 vs Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/428703 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [17:37:44] (03PS6) 10Ottomata: Use profile::kafka::broker for main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/428656 (https://phabricator.wikimedia.org/T192831) [17:39:57] 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4154718 (10herron) mx2001 has been repooled (thanks @ayounsi!) Will monitor closely for the rest of the day [17:39:58] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.29 [keeping static files] (duration: 06m 28s) [17:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:59] (03CR) 10Ottomata: [C: 032] "Looks good: https://puppet-compiler.wmflabs.org/compiler02/11025/kafka2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/428656 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [17:41:34] (03CR) 10Krinkle: [C: 04-1] "This is missing the other array keys and would fail at least for 'tidyConfigFile' being unset. See https://gerrit.wikimedia.org/g/mediawi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428697 (https://phabricator.wikimedia.org/T192855) (owner: 10Jforrester) [17:42:16] !log temp disabling puppet on kafka200* to apply profile::kafka::broker in main-codfw T192831 [17:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:24] T192831: Use profile and prometheus for role::kafka::main::broker - https://phabricator.wikimedia.org/T192831 [17:44:31] (03PS4) 10Andrew Bogott: Rename 'm5' section to 'wikitech' and add explicit hostnames. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) [17:46:18] (03CR) 10Andrew Bogott: [C: 032] Rename 'm5' section to 'wikitech' and add explicit hostnames. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [17:47:34] (03Merged) 10jenkins-bot: Rename 'm5' section to 'wikitech' and add explicit hostnames. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [17:49:08] (03CR) 10jenkins-bot: Rename 'm5' section to 'wikitech' and add explicit hostnames. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [17:49:53] !log andrew@tin Synchronized wmf-config/db-codfw.php: Renaming 'm5' section to 'wikitech' for T189542, one of two (duration: 00m 59s) [17:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:59] T189542: Update updatequerypages::cronjob and refreshlinks::cronjob now that silver no longer has a database - https://phabricator.wikimedia.org/T189542 [17:51:08] !log andrew@tin Synchronized wmf-config/db-eqiad.php: Renaming 'm5' section to 'wikitech' for T189542, two of two (duration: 00m 59s) [17:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:33] (03Abandoned) 10Andrew Bogott: Add 'wikitech' section for wikitech db hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427915 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [17:51:43] RECOVERY - mediawiki-installation DSH group on mw2235 is OK: OK [17:52:17] !log restarting wdqs-updater on all nodes for prometheus jmx exporter update - T192768 [17:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:21] T192768: wdqs-updater crashing not cleanly - https://phabricator.wikimedia.org/T192768 [17:53:37] !log demon@tin Started scap: bootstrap 1.32.0-wmf.1 [17:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:26] (03PS3) 10Krinkle: Fix wgTidyConfig expansion to not ignore `null` as value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428697 (https://phabricator.wikimedia.org/T192855) (owner: 10Jforrester) [17:54:29] anomie: ^ [17:54:46] (Untested) [17:55:48] (03PS6) 10Mobrovac: Enable EventBus for job events for all but wikitech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464) (owner: 10Ppchelko) [17:57:18] (03PS12) 10Volans: First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) [17:57:20] (03PS11) 10Volans: Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) [17:57:22] (03PS13) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) [17:57:24] (03PS9) 10Volans: Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) [17:57:26] (03PS4) 10Volans: Add server side validation of client certificates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) [17:57:31] (03CR) 10Mobrovac: [C: 032] Enable EventBus for job events for all but wikitech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464) (owner: 10Ppchelko) [17:57:43] (03CR) 10jerkins-bot: [V: 04-1] First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [17:57:46] (03CR) 10jerkins-bot: [V: 04-1] Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [17:57:48] (03CR) 10jerkins-bot: [V: 04-1] Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [17:57:50] (03CR) 10jerkins-bot: [V: 04-1] Add server side validation of client certificates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [17:57:52] (03CR) 10jerkins-bot: [V: 04-1] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [17:58:00] (03CR) 10Anomie: [C: 031] "Looks sane. Tested the ->get manually in mwrepl and it returns the correct value, and extract() does extract nulls." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428697 (https://phabricator.wikimedia.org/T192855) (owner: 10Jforrester) [17:58:12] Krinkle: ^ [17:58:23] * mobrovac taking over tin for 2 mins [17:58:44] (03Merged) 10jenkins-bot: Enable EventBus for job events for all but wikitech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464) (owner: 10Ppchelko) [17:59:45] (03PS4) 10Imarlier: coal: Point systemd and uwsgi config to scap-deployed version [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) [17:59:52] (03CR) 10Imarlier: coal: Point systemd and uwsgi config to scap-deployed version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180424T1800) [18:00:39] no_justification: you on scap on tin? [18:00:45] Yep [18:01:09] ah now i see your !log [18:01:23] (03CR) 10jenkins-bot: Enable EventBus for job events for all but wikitech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464) (owner: 10Ppchelko) [18:01:25] hm so we are in a conundrum as i've merged a patch in mw-config [18:02:15] ema: there's been a number of pod restarts indeed. looks like some very math formulae recalculations [18:02:20] PROBLEM - DPKG on mw2241 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:02:21] PROBLEM - dhclient process on mw2241 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:03:06] (03PS2) 10Gergő Tisza: Add jseditor to privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421122 (https://phabricator.wikimedia.org/T190015) [18:03:17] (03PS2) 10Gergő Tisza: Temporarily preserve sysops' JS editing ability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421123 (https://phabricator.wikimedia.org/T190015) [18:04:00] PROBLEM - mediawiki-installation DSH group on mw2241 is CRITICAL: Host mw2241 is not in mediawiki-installation dsh group [18:04:00] PROBLEM - Disk space on mw2241 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:04:25] akosiaris: yeah, there were some 500s today seen from RB [18:04:40] no_justification: k, let me know when you are done so that we resolve this [18:04:51] It's a full scap [18:05:41] PROBLEM - nutcracker port on mw2241 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:05:41] PROBLEM - HHVM processes on mw2241 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:05:41] RECOVERY - mediawiki-installation DSH group on mw2236 is OK: OK [18:07:31] PROBLEM - HHVM rendering on mw2241 is CRITICAL: connect to address 10.192.0.67 and port 80: Connection refused [18:07:31] PROBLEM - nutcracker process on mw2241 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:09:10] PROBLEM - puppet last run on mw2241 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:10:09] mw2241 - still me , doing reinstalls. silencing it [18:10:34] 10Operations, 10Puppet, 10Analytics, 10Maps-Sprint, and 3 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4154846 (10Gehel) [18:10:47] most of them dont do this, it's a question of being more or less lucky [18:11:05] (03PS3) 10Gergő Tisza: Temporarily preserve sysops' JS editing ability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421123 (https://phabricator.wikimedia.org/T190015) [18:11:30] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active, AS6939/IPv4: Connect [18:11:32] (03CR) 10Gergő Tisza: "Thanks, fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421123 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [18:12:31] PROBLEM - Apache HTTP on mw2241 is CRITICAL: connect to address 10.192.0.67 and port 80: Connection refused [18:12:31] PROBLEM - MD RAID on mw2241 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:13:21] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 21 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [18:13:29] (03PS3) 10Gergő Tisza: Remove edituserjs from existing groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421124 (https://phabricator.wikimedia.org/T190015) [18:13:31] RECOVERY - Apache HTTP on mw2241 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [18:14:10] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 20 probes of 300 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [18:14:20] PROBLEM - Check size of conntrack table on mw2241 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:14:51] (03PS1) 10Cmjohnson: Updating dhcpd file for new db's [puppet] - 10https://gerrit.wikimedia.org/r/428705 (https://phabricator.wikimedia.org/T191792) [18:15:13] (03CR) 10jerkins-bot: [V: 04-1] Updating dhcpd file for new db's [puppet] - 10https://gerrit.wikimedia.org/r/428705 (https://phabricator.wikimedia.org/T191792) (owner: 10Cmjohnson) [18:16:01] PROBLEM - Nginx local proxy to apache on mw2241 is CRITICAL: connect to address 10.192.0.67 and port 443: Connection refused [18:16:01] PROBLEM - Check systemd state on mw2241 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:16:46] (03PS4) 10Gergő Tisza: Remove sitewide and user JS editing from old groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421124 (https://phabricator.wikimedia.org/T190015) [18:16:54] (03PS2) 10Cmjohnson: Updating dhcpd file for new db's [puppet] - 10https://gerrit.wikimedia.org/r/428705 (https://phabricator.wikimedia.org/T191792) [18:16:59] (03PS4) 10Gergő Tisza: Enforce that jseditor is the only group that can edit sitewide JS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421125 (https://phabricator.wikimedia.org/T190015) [18:17:41] (03PS5) 10Gergő Tisza: Enforce that jseditor is the only group that can edit non-own JS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421125 (https://phabricator.wikimedia.org/T190015) [18:18:10] (03CR) 10Cmjohnson: [C: 032] Updating dhcpd file for new db's [puppet] - 10https://gerrit.wikimedia.org/r/428705 (https://phabricator.wikimedia.org/T191792) (owner: 10Cmjohnson) [18:18:30] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 0 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [18:18:40] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 117, down: 6, shutdown: 0 [18:19:07] (03PS1) 10Imarlier: scap::target: List allowed service commands, instead of wildcard [puppet] - 10https://gerrit.wikimedia.org/r/428707 [18:19:10] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 5 probes of 300 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [18:24:18] (03PS1) 10Andrew Bogott: WMCS Vms: explicitly set openstack version to 'Mitaka'. [puppet] - 10https://gerrit.wikimedia.org/r/428708 [18:25:30] (03CR) 10Andrew Bogott: [C: 032] WMCS Vms: explicitly set openstack version to 'Mitaka'. [puppet] - 10https://gerrit.wikimedia.org/r/428708 (owner: 10Andrew Bogott) [18:26:13] (03PS1) 10Ottomata: Add instance_selector param to jmx_exporter_config and use for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/428710 [18:26:33] (03PS2) 10Ottomata: Add instance_selector param to jmx_exporter_config and use for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/428710 [18:26:58] (03CR) 10jerkins-bot: [V: 04-1] Add instance_selector param to jmx_exporter_config and use for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/428710 (owner: 10Ottomata) [18:27:00] RECOVERY - HHVM processes on mw2241 is OK: PROCS OK: 6 processes with command name hhvm [18:27:20] RECOVERY - Disk space on mw2241 is OK: DISK OK [18:27:21] RECOVERY - Check size of conntrack table on mw2241 is OK: OK: nf_conntrack is 0 % full [18:27:40] RECOVERY - dhclient process on mw2241 is OK: PROCS OK: 0 processes with command name dhclient [18:27:40] RECOVERY - DPKG on mw2241 is OK: All packages OK [18:27:54] (03PS3) 10Ottomata: Add instance_selector param to jmx_exporter_config and use for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/428710 [18:28:02] RECOVERY - HHVM rendering on mw2241 is OK: HTTP OK: HTTP/1.1 200 OK - 74364 bytes in 7.153 second response time [18:28:02] RECOVERY - MD RAID on mw2241 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:29:32] RECOVERY - Nginx local proxy to apache on mw2241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 9.771 second response time [18:30:48] (03PS4) 10Ottomata: Add instance_selector param to jmx_exporter_config and use for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/428710 [18:34:52] (03CR) 10Ottomata: [C: 032] "Godog: Feel free to object, we can change, but I need to merge this to fix instances in codfw." [puppet] - 10https://gerrit.wikimedia.org/r/428710 (owner: 10Ottomata) [18:34:52] RECOVERY - nutcracker process on mw2241 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [18:35:02] RECOVERY - nutcracker port on mw2241 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [18:35:18] 10Operations, 10Ops-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4154930 (10herron) p:05Triage>03Normal [18:35:23] RECOVERY - Check systemd state on mw2241 is OK: OK - running: The system is fully operational [18:37:08] (03CR) 10Jforrester: [C: 031] Fix wgTidyConfig expansion to not ignore `null` as value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428697 (https://phabricator.wikimedia.org/T192855) (owner: 10Jforrester) [18:39:12] RECOVERY - puppet last run on mw2241 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:50:59] (03PS1) 10Ottomata: Use profile::kafka::broker in main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/428711 (https://phabricator.wikimedia.org/T192831) [18:53:13] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11029/kafka1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/428711 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [18:53:21] 10Operations, 10Ops-Access-Requests, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4154966 (10herron) Looping in @RStallman-legalteam for the NDA portion [18:54:11] !log temp disabling puppet and applying profile::kafka::broker on kafka100* T192831 [18:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:18] T192831: Use profile and prometheus for role::kafka::main::broker - https://phabricator.wikimedia.org/T192831 [18:57:07] 10Operations, 10Code-Stewardship-Reviews, 10Services (watching): zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#4154995 (10danstillman) > there are currently two deadlines here: > > Firefox 52 ESR (which this is indirectly based on) EOLs in August 2018 This w... [18:57:20] 10Operations, 10Ops-Access-Requests, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4150980 (10herron) p:05Triage>03Normal [19:00:05] no_justification: #bothumor I � Unicode. All rise for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180424T1900). [19:00:51] * no_justification twiddles thumbs [19:08:51] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4155054 (10Cmjohnson) @Marostegui @jcrespo All of the db's with the exception of db1120 (c6) are installed and ready for you to take over. Most likely I have a bad cable or a lo... [19:09:09] (03PS1) 10Ottomata: Remove unused role::kafka::main::broker [puppet] - 10https://gerrit.wikimedia.org/r/428717 (https://phabricator.wikimedia.org/T192831) [19:11:57] PROBLEM - configured eth on db1118 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:11:57] PROBLEM - dhclient process on db1121 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:13:46] PROBLEM - dhclient process on db1118 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:13:46] PROBLEM - configured eth on db1117 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:13:46] PROBLEM - puppet last run on db1121 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:14:21] (03PS1) 10Cmjohnson: Removing mgmt entries stat1003 [dns] - 10https://gerrit.wikimedia.org/r/428719 (https://phabricator.wikimedia.org/T175150) [19:15:26] PROBLEM - dhclient process on db1117 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:15:26] PROBLEM - puppet last run on db1118 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:15:57] (03CR) 10Cmjohnson: [C: 032] Removing mgmt entries stat1003 [dns] - 10https://gerrit.wikimedia.org/r/428719 (https://phabricator.wikimedia.org/T175150) (owner: 10Cmjohnson) [19:17:07] PROBLEM - puppet last run on db1117 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:17:07] PROBLEM - Check systemd state on db1121 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:18:47] PROBLEM - Check the NTP synchronisation status of timesyncd on db1121 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:18:47] PROBLEM - Check systemd state on db1118 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:20:36] PROBLEM - Check systemd state on db1117 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:20:36] PROBLEM - Check the NTP synchronisation status of timesyncd on db1118 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:20:36] PROBLEM - DPKG on db1121 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:20:36] (03PS1) 10Thcipriani: Scap: MediaWiki Canary: setup swagger checks [puppet] - 10https://gerrit.wikimedia.org/r/428721 (https://phabricator.wikimedia.org/T136839) [19:22:17] PROBLEM - Check the NTP synchronisation status of timesyncd on db1117 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:22:17] PROBLEM - DPKG on db1118 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:22:17] PROBLEM - Disk space on db1121 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:22:56] (03PS1) 10Smalyshev: Set SPARQL services to use internal cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) [19:23:57] PROBLEM - DPKG on db1117 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:23:57] PROBLEM - Disk space on db1118 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:24:05] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#4155087 (10Cmjohnson) [19:24:07] (03CR) 10jerkins-bot: [V: 04-1] Set SPARQL services to use internal cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev) [19:24:29] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3584101 (10Cmjohnson) 05Open>03Resolved [19:25:16] (03PS2) 10Smalyshev: Set SPARQL services to use internal cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) [19:25:46] PROBLEM - Disk space on db1117 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:27:57] RECOVERY - DPKG on db1117 is OK: All packages OK [19:28:28] RECOVERY - dhclient process on db1117 is OK: PROCS OK: 0 processes with command name dhclient [19:28:37] RECOVERY - Check systemd state on db1117 is OK: OK - running: The system is fully operational [19:28:47] RECOVERY - Disk space on db1117 is OK: DISK OK [19:28:47] RECOVERY - configured eth on db1117 is OK: OK - interfaces up [19:29:27] RECOVERY - DPKG on db1118 is OK: All packages OK [19:29:38] RECOVERY - DPKG on db1121 is OK: All packages OK [19:29:48] RECOVERY - dhclient process on db1118 is OK: PROCS OK: 0 processes with command name dhclient [19:29:58] RECOVERY - Check systemd state on db1118 is OK: OK - running: The system is fully operational [19:30:07] RECOVERY - Disk space on db1118 is OK: DISK OK [19:30:08] RECOVERY - configured eth on db1118 is OK: OK - interfaces up [19:30:08] RECOVERY - dhclient process on db1121 is OK: PROCS OK: 0 processes with command name dhclient [19:30:27] RECOVERY - Disk space on db1121 is OK: DISK OK [19:30:27] RECOVERY - puppet last run on db1118 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:32:08] RECOVERY - puppet last run on db1117 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:32:17] PROBLEM - grafana-admin.wikimedia.org on krypton is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:18] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kaf [19:32:19] PROBLEM - Varnish frontend child restarted on cp1099 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1099&var-datasource=eqiad+prometheus/ops [19:32:27] PROBLEM - grafana.wikimedia.org on krypton is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:37] PROBLEM - PyBal BGP sessions are established on lvs1003 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqiad%2520prometheus%252Fops [19:32:39] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1004 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kafka-jumbo1004 [19:32:39] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kaf [19:32:40] ^^ is me, yar having prometheus problems [19:32:47] PROBLEM - High lag on wdqs1005 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:32:47] PROBLEM - Varnish backend child restarted on cp1054 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1054&var-datasource=eqiad+prometheus/ops [19:32:47] PROBLEM - Varnish frontend child restarted on cp1064 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1064&var-datasource=eqiad+prometheus/ops [19:32:48] PROBLEM - Varnish backend child restarted on cp1064 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1064&var-datasource=eqiad+prometheus/ops [19:32:54] atually [19:32:59] PROBLEM - High lag on wdqs1007 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:32:59] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 53.02, 27.76, 20.74 [19:33:00] this is no [19:33:04] this is prometheus in general? [19:33:07] PROBLEM - PyBal BGP sessions are established on lvs1002 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqiad%2520prometheus%252Fops [19:33:08] PROBLEM - Varnish backend child restarted on cp1073 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1073&var-datasource=eqiad+prometheus/ops [19:33:08] PROBLEM - Zookeeper Alive Client Connections too high on conf1003 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=6&fullscreen [19:33:09] PROBLEM - Varnish backend child restarted on cp1068 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1068&var-datasource=eqiad+prometheus/ops [19:33:16] yes, looks like prometheus failing in eqiad [19:33:17] PROBLEM - Varnish frontend child restarted on cp1045 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1045&var-datasource=eqiad+prometheus/ops [19:33:18] PROBLEM - Varnish backend child restarted on cp1061 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1061&var-datasource=eqiad+prometheus/ops [19:33:18] PROBLEM - Varnish frontend child restarted on cp1052 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1052&var-datasource=eqiad+prometheus/ops [19:33:19] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1002 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kafka-jumbo1002 [19:33:19] PROBLEM - Varnish frontend child restarted on cp1051 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1051&var-datasource=eqiad+prometheus/ops [19:33:19] PROBLEM - Varnish frontend child restarted on cp1068 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1068&var-datasource=eqiad+prometheus/ops [19:33:19] PROBLEM - Varnish frontend child restarted on cp1073 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1073&var-datasource=eqiad+prometheus/ops [19:33:20] PROBLEM - Varnish backend child restarted on cp1072 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1072&var-datasource=eqiad+prometheus/ops [19:33:20] PROBLEM - Varnish backend child restarted on cp1074 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1074&var-datasource=eqiad+prometheus/ops [19:33:21] PROBLEM - Varnish backend child restarted on cp1065 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1065&var-datasource=eqiad+prometheus/ops [19:33:21] PROBLEM - Varnish backend child restarted on cp1051 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1051&var-datasource=eqiad+prometheus/ops [19:33:22] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 49.34, 26.95, 20.92 [19:33:23] and thus criticalling everything that's prometheus-monitored [19:33:28] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kaf [19:33:28] PROBLEM - Zookeeper Alive Client Connections too high on conf1002 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=6&fullscreen [19:33:28] PROBLEM - Varnish backend child restarted on cp1048 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1048&var-datasource=eqiad+prometheus/ops [19:33:39] PROBLEM - High lag on wdqs1004 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:33:39] PROBLEM - Varnish frontend child restarted on cp1074 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1074&var-datasource=eqiad+prometheus/ops [19:33:39] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kaf [19:33:39] PROBLEM - Varnish frontend child restarted on cp1063 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1063&var-datasource=eqiad+prometheus/ops [19:33:40] PROBLEM - PyBal BGP sessions are established on lvs1006 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqiad%2520prometheus%252Fops [19:33:41] PROBLEM - Varnish frontend child restarted on cp1066 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1066&var-datasource=eqiad+prometheus/ops [19:33:41] yeah [19:33:41] PROBLEM - Varnish backend child restarted on cp1052 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1052&var-datasource=eqiad+prometheus/ops [19:33:41] PROBLEM - Varnish frontend child restarted on cp1054 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1054&var-datasource=eqiad+prometheus/ops [19:33:42] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1005 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kafka-jumbo1005 [19:33:42] PROBLEM - Zookeeper Alive Client Connections too high on conf1001 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=6&fullscreen [19:33:42] but there are also some non-prometheus alerts [19:33:43] PROBLEM - Varnish backend child restarted on cp1099 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1099&var-datasource=eqiad+prometheus/ops [19:33:43] PROBLEM - Varnish backend child restarted on cp1058 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1058&var-datasource=eqiad+prometheus/ops [19:33:47] PROBLEM - Varnish frontend child restarted on cp1049 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1049&var-datasource=eqiad+prometheus/ops [19:33:47] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1006 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kafka-jumbo1006 [19:33:50] PROBLEM - High lag on wdqs1008 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:33:50] PROBLEM - Varnish frontend child restarted on cp1067 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1067&var-datasource=eqiad+prometheus/ops [19:33:51] PROBLEM - High lag on wdqs1006 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:33:51] PROBLEM - Varnish backend child restarted on cp1055 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1055&var-datasource=eqiad+prometheus/ops [19:33:53] i have been mucking with prometheus..., but not really, only jmx expoter stuff [19:33:57] PROBLEM - statsd UDP receive errors are elevated on graphite1001 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [19:33:57] RECOVERY - puppet last run on db1121 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:33:59] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1001 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kafka-jumbo1001 [19:33:59] and redefining dashbaord and alert queries for kafka [19:34:03] suggesting perhaps there is a real issue going on too, and perhaps the real issue could be spamming the prometheus server causing further fallout [19:34:05] (03CR) 10Anomie: [C: 031] Temporarily preserve sysops' JS editing ability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421123 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [19:34:07] RECOVERY - High lag on wdqs1007 is OK: (C)3600 ge (W)1200 ge 20 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:34:09] RECOVERY - Varnish backend child restarted on cp1073 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1073&var-datasource=eqiad+prometheus/ops [19:34:09] PROBLEM - statsd UDP receive errors are elevated on graphite1003 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [19:34:10] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kaf [19:34:10] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kaf [19:34:17] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1003 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kafka-jumbo1003 [19:34:17] PROBLEM - High lag on wdqs1003 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:34:17] PROBLEM - Varnish backend child restarted on cp1049 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1049&var-datasource=eqiad+prometheus/ops [19:34:18] PROBLEM - Varnish backend child restarted on cp1066 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1066&var-datasource=eqiad+prometheus/ops [19:34:18] PROBLEM - HTTP availability for Varnish on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/global/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:34:23] there's some alerts within the spam for high load on MW*, for wdqs lag, etc [19:34:27] PROBLEM - Varnish backend child restarted on cp1050 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1050&var-datasource=eqiad+prometheus/ops [19:34:28] PROBLEM - Varnish backend child restarted on cp1062 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1062&var-datasource=eqiad+prometheus/ops [19:34:29] which are not just prom-fail [19:34:38] RECOVERY - Varnish frontend child restarted on cp1099 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1099&var-datasource=eqiad+prometheus/ops [19:34:38] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)5 ge 0 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kafka-jumbo1005 [19:34:39] RECOVERY - Zookeeper Alive Client Connections too high on conf1002 is OK: (C)1024 ge (W)512 ge 13 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=6&fullscreen [19:34:39] RECOVERY - Varnish frontend child restarted on cp1063 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1063&var-datasource=eqiad+prometheus/ops [19:34:39] RECOVERY - Varnish backend child restarted on cp1048 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1048&var-datasource=eqiad+prometheus/ops [19:34:40] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)5 ge 0 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kafka-jumbo1003 [19:34:40] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)5 ge 0 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kafka-jumbo1001 [19:34:40] RECOVERY - High lag on wdqs1004 is OK: (C)3600 ge (W)1200 ge 53 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:34:41] RECOVERY - Varnish frontend child restarted on cp1074 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1074&var-datasource=eqiad+prometheus/ops [19:34:42] wdqs lag is prometheus based [19:34:45] here I am [19:34:47] RECOVERY - grafana.wikimedia.org on krypton is OK: HTTP OK: HTTP/1.1 200 OK - 31389 bytes in 6.537 second response time [19:34:47] RECOVERY - Varnish frontend child restarted on cp1066 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1066&var-datasource=eqiad+prometheus/ops [19:34:47] RECOVERY - Varnish frontend child restarted on cp1054 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1054&var-datasource=eqiad+prometheus/ops [19:34:47] RECOVERY - Zookeeper Alive Client Connections too high on conf1001 is OK: (C)1024 ge (W)512 ge 15 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=6&fullscreen [19:34:48] RECOVERY - Varnish backend child restarted on cp1052 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1052&var-datasource=eqiad+prometheus/ops [19:34:48] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1005 is OK: (C)5e+06 ge (W)1e+06 ge 0 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_brokers=kafka-jumbo1005 [19:34:49] RECOVERY - PyBal BGP sessions are established on lvs1006 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqiad%2520prometheus%252Fops [19:34:57] RECOVERY - Varnish backend child restarted on cp1099 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1099&var-datasource=eqiad+prometheus/ops [19:34:57] RECOVERY - Varnish backend child restarted on cp1058 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1058&var-datasource=eqiad+prometheus/ops [19:35:08] gehel: yeah but it had a number? [19:35:09] PROBLEM - Varnish frontend child restarted on cp1061 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1061&var-datasource=eqiad+prometheus/ops [19:35:13] gehel: is something wrong with icinga lag reporting? [19:35:13] prom1003 is very loaded, bblack saying that might be just a symptom? [19:35:17] PROBLEM - PyBal BGP sessions are established on lvs1001 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqiad%2520prometheus%252Fops [19:35:17] PROBLEM - Varnish frontend child restarted on cp1071 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1071&var-datasource=eqiad+prometheus/ops [19:35:17] PROBLEM - Varnish frontend child restarted on cp1050 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1050&var-datasource=eqiad+prometheus/ops [19:35:17] PROBLEM - PyBal BGP sessions are established on lvs1004 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqiad%2520prometheus%252Fops [19:35:17] PROBLEM - Varnish frontend child restarted on cp1072 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1072&var-datasource=eqiad+prometheus/ops [19:35:18] RECOVERY - HTTP availability for Varnish on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:35:18] PROBLEM - Varnish frontend child restarted on cp1065 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1065&var-datasource=eqiad+prometheus/ops [19:35:26] shall we talk in mw-sec? hard to read here [19:35:28] PROBLEM - Varnish frontend child restarted on cp1062 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1062&var-datasource=eqiad+prometheus/ops [19:35:38] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 21.95, 25.10, 21.18 [19:36:07] !log stop ircecho on einstenium - icinga shower [19:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:40] for the record, I did update prometheus-jmx-exporter on wdqs servers earlier today [19:37:40] so high CPU load on API appserver [19:37:48] it seems odd to be correlated with a prometheus issue [19:38:36] we have been seeing a hhvm issue that triggers a bit more load on apis, fyi [19:38:41] ok [19:38:55] https://phabricator.wikimedia.org/T192610 is the ticket from esams/bast3002 having the same prometheus issue [19:39:09] this seems probably similar, but maybe we hold off and wait and see if it self-resolves [19:39:12] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=prometheus1003&var-datasource=eqiad%20prometheus%2Fops [19:39:20] prometheus1003's cpu usage skyrocketed [19:39:21] (my restart attempt on bast3002 took forever, and ended up losing some data) [19:40:23] hm [19:40:32] !log demon@tin Finished scap: bootstrap 1.32.0-wmf.1 (duration: 106m 55s) [19:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:04] same thing for prometheus1004, but it is going down [19:45:13] (03PS1) 10BBlack: s/CRITICAL/UNKNOWN/ for icinga->prometheus fetch/decode errors [puppet] - 10https://gerrit.wikimedia.org/r/428725 (https://phabricator.wikimedia.org/T192610) [19:45:56] mobrovac: If you wanna do your restbase thing now, go ahead [19:46:07] I've gotta finish train, but I can take a break for now, the heavy lifting is done [19:46:30] 10Operations, 10Ops-Access-Requests, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4155192 (10Urbanecm) [19:46:50] bblack: Fwiw, MW did just go out and rebuild l10n on all hosts. Only on testwiki tho, so no traffic [19:47:06] did the wmf.30 train only roll out to en.wp today? [19:47:12] right, I just wonder if there's possibly some interactions between live things and prometheus load [19:47:21] 10Operations, 10Ops-Access-Requests, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4150980 (10Urbanecm) ^^^ key changed to 4096 bit per https://wikitech.wikimedia.org/wiki/Production_shell_access ^^^ [19:47:29] MatmaRex: yesterday [19:47:33] (e.g. some logging/monitoring agent fires 10x the event rate towards prometheus servers when certain prod deployments happen or certain prod problems happen, etc) [19:47:44] greg-g: ah. okay, thanks [19:47:53] 10Operations, 10Ops-Access-Requests, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4155197 (10RStallman-legalteam) @ Urbanecm - for the NDA, I'll need an email address and a physical address. You can email me the details: rstallman@w... [19:47:53] http://tools.wmflabs.org/sal/log/AWL0PGpUBEfgIt1jfJZz [19:48:07] could grafana do it? i'm messing with templates in a dashboard, and it seems that my all value is somehow not filtering? its possible it is issueing a query for all nodes [19:48:15] yes, grafana was a trigger last time [19:48:22] i think i did it then [19:48:31] inefficient grafan->prometheus queries can do this, even just readonly ones [19:48:33] not sure why my all value is selecting everything [19:48:55] the template selector only shows 6 options [19:49:04] (which seems like an architectural flaw to me - reading stats shouldn't crush the thing receiving them and triggering alerts on them) [19:49:12] indeed [19:49:44] anyways, let's bring back icinga-wm? [19:49:50] yep, doing it [19:49:55] !log re-enable ircecho [19:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:33] I'm gonna go ahead and merge my patch FWIW. It can't be any worse than present situation, and the europeans can debate it or revert it tomorrow :) [19:50:46] aye [19:50:54] ack [19:51:21] (03CR) 10BBlack: [C: 032] s/CRITICAL/UNKNOWN/ for icinga->prometheus fetch/decode errors [puppet] - 10https://gerrit.wikimedia.org/r/428725 (https://phabricator.wikimedia.org/T192610) (owner: 10BBlack) [19:52:17] RECOVERY - Check the NTP synchronisation status of timesyncd on db1117 is OK: OK: synced at Tue 2018-04-24 19:52:14 UTC. [19:53:28] !log prometheus-fail switched to UNKNOWNs for now in https://gerrit.wikimedia.org/r/#/c/428725/ - may want to look at this further later, intent is to reduce odds of debilitating ops spam for the evening. [19:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:22] ircecho up and running [19:54:33] einsteinium already puppeted with script change too [19:54:40] so should be good, assuming it works! [19:54:51] thanks! [19:55:05] * elukey off again :) [19:58:08] k no_justification, going now, should take only 2 mins [19:59:08] no_justification: hi, any idea how to fix T192877 ? [19:59:09] T192877: Fix gor.wikipedia to use Gorontalo language - https://phabricator.wikimedia.org/T192877 [19:59:27] full scap and rebuildmessages didn't worked [20:03:09] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Use EventBus for all wikis but wikitech - T191464 (duration: 01m 26s) [20:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:15] T191464: Enable CP4JQ support for private wikis - https://phabricator.wikimedia.org/T191464 [20:06:53] no_justification: i'm {{done}} [20:07:38] 10Operations, 10Ops-Access-Requests, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4155259 (10Urbanecm) >>! In T192830#4155197, @RStallman-legalteam wrote: > @ Urbanecm - for the NDA, I'll need an email address and a physical address... [20:09:44] Nope :) [20:11:29] 10Operations, 10Puppet, 10Analytics, 10Cassandra, and 4 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4155263 (10mobrovac) [20:23:45] !log Run namespaceDupes on gorwiki [20:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:17] !log starting Cassandra bootstrap, restbase1010-b -- T189822 [20:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:23] T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822 [20:33:23] (03PS1) 10Ottomata: Fix kafka alert dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/428826 (https://phabricator.wikimedia.org/T192831) [20:33:27] (03PS2) 10Ottomata: Remove unused role::kafka::main::broker [puppet] - 10https://gerrit.wikimedia.org/r/428717 (https://phabricator.wikimedia.org/T192831) [20:33:30] (03CR) 10Ottomata: [V: 032 C: 032] Remove unused role::kafka::main::broker [puppet] - 10https://gerrit.wikimedia.org/r/428717 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [20:33:42] (03PS2) 10Ottomata: Fix kafka alert dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/428826 (https://phabricator.wikimedia.org/T192831) [20:34:33] (03PS3) 10Ottomata: Fix kafka alert dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/428826 (https://phabricator.wikimedia.org/T192831) [20:35:19] (03CR) 10Ottomata: [C: 032] Fix kafka alert dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/428826 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [20:41:36] PROBLEM - Disk space on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:41:36] PROBLEM - dhclient process on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:41:37] PROBLEM - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.115 and port 9042: Connection refused [20:41:46] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:41:46] PROBLEM - DPKG on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:41:47] PROBLEM - cassandra-b service on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:41:47] PROBLEM - Check size of conntrack table on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:41:57] PROBLEM - Check whether ferm is active by checking the default input chain on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:41:57] PROBLEM - cassandra-a service on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:42:16] PROBLEM - cassandra-c SSL 10.64.0.116:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:42:17] PROBLEM - MD RAID on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:42:17] PROBLEM - cassandra-c service on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:42:26] PROBLEM - configured eth on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:42:26] PROBLEM - Check systemd state on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:43:57] PROBLEM - puppet last run on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:45:07] PROBLEM - HP RAID on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:47:03] (03CR) 10Chad: [C: 032] Group0 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428702 (owner: 10Chad) [20:47:06] PROBLEM - IPMI Sensor Status on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:47:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: decom niobium/WMF3428 - https://phabricator.wikimedia.org/T191355#4155411 (10Cmjohnson) [20:48:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: decom niobium/WMF3428 - https://phabricator.wikimedia.org/T191355#4102362 (10Cmjohnson) Most of this was already completed, verified not in puppet, production dns was removed, switch port was already disabled. Moving to the next steps of wiping an... [20:48:22] (03Merged) 10jenkins-bot: Group0 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428702 (owner: 10Chad) [20:49:47] i have a small question about our deployment: while we're upgrading MediaWiki from wmf.N to wmf.(N+1), is it possible that a request would be served by a machine that has only some of the new code synced to it? [20:50:07] PROBLEM - Check the NTP synchronisation status of timesyncd on restbase1010 is CRITICAL: Return code of 255 is out of bounds [20:53:39] (03PS1) 10Cmjohnson: decommission db1029 and db1030 [puppet] - 10https://gerrit.wikimedia.org/r/428829 (https://phabricator.wikimedia.org/T184054) [20:54:07] (03CR) 10jerkins-bot: [V: 04-1] decommission db1029 and db1030 [puppet] - 10https://gerrit.wikimedia.org/r/428829 (https://phabricator.wikimedia.org/T184054) (owner: 10Cmjohnson) [20:54:21] (03CR) 10jenkins-bot: Group0 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428702 (owner: 10Chad) [20:54:26] (03PS2) 10Cmjohnson: decommission db1029 and db1030 [puppet] - 10https://gerrit.wikimedia.org/r/428829 (https://phabricator.wikimedia.org/T184054) [20:55:06] !log demon@tin rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.1 [20:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:22] (03CR) 10Cmjohnson: [C: 032] decommission db1029 and db1030 [puppet] - 10https://gerrit.wikimedia.org/r/428829 (https://phabricator.wikimedia.org/T184054) (owner: 10Cmjohnson) [20:59:29] PROBLEM - Long running screen/tmux on restbase1010 is CRITICAL: Return code of 255 is out of bounds [21:18:35] (03PS1) 10Imarlier: graphite: allow data requests from performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/428836 (https://phabricator.wikimedia.org/T191994) [21:19:33] (03PS1) 10Cmjohnson: Removing db1034 from site.pp (decom) [puppet] - 10https://gerrit.wikimedia.org/r/428837 (https://phabricator.wikimedia.org/T182556) [21:20:04] (03CR) 10jerkins-bot: [V: 04-1] Removing db1034 from site.pp (decom) [puppet] - 10https://gerrit.wikimedia.org/r/428837 (https://phabricator.wikimedia.org/T182556) (owner: 10Cmjohnson) [21:20:46] (03PS2) 10Cmjohnson: Removing db1034 from site.pp (decom) [puppet] - 10https://gerrit.wikimedia.org/r/428837 (https://phabricator.wikimedia.org/T182556) [21:22:03] (03CR) 10Cmjohnson: [C: 032] Removing db1034 from site.pp (decom) [puppet] - 10https://gerrit.wikimedia.org/r/428837 (https://phabricator.wikimedia.org/T182556) (owner: 10Cmjohnson) [21:24:43] (03CR) 10Imarlier: "Puppet compiler run looks right to me: https://puppet-compiler.wmflabs.org/compiler02/11030/graphite1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/428836 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [21:29:04] (03PS1) 10Cmjohnson: Removing dns entries db1029|31|34 [dns] - 10https://gerrit.wikimedia.org/r/428839 (https://phabricator.wikimedia.org/T182556) [21:29:45] (03CR) 10Cmjohnson: [C: 032] Removing dns entries db1029|31|34 [dns] - 10https://gerrit.wikimedia.org/r/428839 (https://phabricator.wikimedia.org/T182556) (owner: 10Cmjohnson) [21:30:42] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1034 - https://phabricator.wikimedia.org/T182556#4155612 (10Cmjohnson) [21:32:10] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#3870892 (10Cmjohnson) [21:37:11] (03CR) 10Krinkle: [C: 031] graphite: allow data requests from performance.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428836 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [21:41:43] (03PS1) 10Chad: mobilelanding: Use actual location of MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428842 [21:49:35] (03PS1) 10Chad: Drop $wgTitle usages from robots.txt and extract2.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428843 [21:50:51] (03PS1) 10Chad: Drop MEDIAWIKI_DBLIST_DIR, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428844 [21:51:33] (03PS1) 10Chad: Drop MEDIAWIKI_DIRECTORY_REGEX, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428845 [21:52:56] (03PS2) 10Chad: Drop MEDIAWIKI_DIRECTORY_REGEX & MEDIAWIKI_VERSION_REGEX unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428845 [21:54:09] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Matthias Geisler to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4155690 (10Dzahn) a:05RStallman-legalteam>03Dzahn [21:54:10] no_justification: :) - how are you accessing mobilelanding.php? [21:54:52] I'm not! [21:54:57] I'm just seeing errors in the logs [21:55:06] Oh, interesting. [21:56:00] https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2018.04.24/hhvm?id=AWL5k9GuGDCXvhDUcrTc&_g=h@66534ad [21:56:30] !log adding LDAP user 'bitpogo' to group 'wmde' (T191523) [21:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:37] T191523: Request to add Matthias Geisler to the ldap/wmde group - https://phabricator.wikimedia.org/T191523 [21:57:04] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Matthias Geisler to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4155708 (10Dzahn) [21:57:37] no_justification: Interesting. The only hostname that exposes it unconditionally redirects to a docroot that doesn't expose it. [21:57:41] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Matthias Geisler to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4108485 (10Dzahn) 05Open>03Resolved @Matthias_Geisler_WMDE Done! you have been added to the group. [21:57:43] Maybe internal monitoring? [21:57:44] Anyhow [21:57:48] (03CR) 10Krinkle: [C: 031] mobilelanding: Use actual location of MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428842 (owner: 10Chad) [21:58:07] Krinkle: In any case, I can't find /any/ pointers to it [21:58:09] From anywhere [21:58:13] Other than the one apache rewrite [21:58:18] Yeah [21:58:19] Which, as you note, doesn't work. [21:58:23] :) [21:58:58] (03CR) 10Krinkle: [C: 031] Drop MEDIAWIKI_DIRECTORY_REGEX & MEDIAWIKI_VERSION_REGEX unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428845 (owner: 10Chad) [21:59:05] (03CR) 10Krinkle: [C: 031] Drop MEDIAWIKI_DBLIST_DIR, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428844 (owner: 10Chad) [22:00:37] (03CR) 10Krinkle: [C: 031] Drop $wgTitle usages from robots.txt and extract2.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428843 (owner: 10Chad) [22:04:04] RECOVERY - mediawiki-installation DSH group on mw2241 is OK: OK [22:04:49] Considering all of the routing/redirection happens at the varnish layer, I'm inclined to kill that vhost outright [22:23:25] herron: re: access requests, this one is ready to go, if you feel like reviewing or merging, go ahead https://gerrit.wikimedia.org/r/#/c/427944/ [22:24:35] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1977 bytes in 0.099 second response time [22:24:54] never hurts to have a second pair of eyes check the UID and key [22:25:32] (03CR) 10Dzahn: [C: 031] admins: create shell account for mepps, add to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/427944 (https://phabricator.wikimedia.org/T192472) (owner: 10Dzahn) [22:32:24] PROBLEM - mediawiki-installation DSH group on mw2237 is CRITICAL: Host mw2237 is not in mediawiki-installation dsh group [22:36:35] PROBLEM - mediawiki-installation DSH group on mw2238 is CRITICAL: Host mw2238 is not in mediawiki-installation dsh group [22:36:46] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#4155786 (10ayounsi) DHCP forwarder updated on pfw3-eqiad: ```lang=diff [edit forwarding-options helpers bootp] + server 10.64.40.70; - server 10.64.40.66; ``` [22:44:35] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1975 bytes in 0.100 second response time [22:47:50] jouncebot: now [22:47:50] No deployments scheduled for the next 0 hour(s) and 12 minute(s) [22:47:52] jouncebot: ext [22:47:55] jouncebot: next [22:47:55] In 0 hour(s) and 12 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180424T2300) [22:48:05] (03CR) 10Legoktm: [C: 032] Fix wgTidyConfig expansion to not ignore `null` as value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428697 (https://phabricator.wikimedia.org/T192855) (owner: 10Jforrester) [22:48:15] RECOVERY - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is OK: TCP OK - 0.000 second response time on 10.64.0.115 port 9042 [22:49:23] (03Merged) 10jenkins-bot: Fix wgTidyConfig expansion to not ignore `null` as value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428697 (https://phabricator.wikimedia.org/T192855) (owner: 10Jforrester) [22:49:37] (03CR) 10jenkins-bot: Fix wgTidyConfig expansion to not ignore `null` as value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428697 (https://phabricator.wikimedia.org/T192855) (owner: 10Jforrester) [22:51:58] (03PS1) 10Catrope: Enable mapframe on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428851 (https://phabricator.wikimedia.org/T191584) [22:52:15] (03CR) 10Catrope: [C: 04-2] "Not before April 30th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428851 (https://phabricator.wikimedia.org/T191584) (owner: 10Catrope) [22:53:20] !log legoktm@tin Synchronized wmf-config/CommonSettings.php: Fix wgTidyConfig and restore proper tidy & Remex config - T192855 (duration: 01m 16s) [22:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:27] T192855: Remex enabled on all wikis in MW 1.30-wmf.30 exposing corruption (signatures coloring unrelated follow-up sections, etc.) on unfixed content - https://phabricator.wikimedia.org/T192855 [22:55:11] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: touch (duration: 01m 18s) [22:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180424T2300). [23:00:05] mooeypoo and Amir1: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:58] o/ [23:01:07] (03CR) 10Chad: [C: 032] mobilelanding: Use actual location of MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428842 (owner: 10Chad) [23:02:26] (03Merged) 10jenkins-bot: mobilelanding: Use actual location of MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428842 (owner: 10Chad) [23:02:39] (03CR) 10jenkins-bot: mobilelanding: Use actual location of MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428842 (owner: 10Chad) [23:04:32] !log demon@tin Synchronized docroot/m.wikipedia.org/w/mobilelanding.php: unbreak multiversion loading for a totally useless script (duration: 01m 16s) [23:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:43] Krinkle: :p ^ [23:06:01] anyone doing SWAT? [23:08:16] !log mw2242.codfw , mw2255.codfw et al.. more stretch reinstalls going on [23:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:44] (03PS1) 10Dbarratt: Disable Datetime Selector on Special:Block on all wikis except Meta, MediaWiki, and German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) [23:09:55] RECOVERY - mediawiki-installation DSH group on mw2238 is OK: OK [23:10:24] RECOVERY - mediawiki-installation DSH group on mw2237 is OK: OK [23:10:56] (03PS2) 10Dbarratt: Disable Datetime Selector on Special:Block on all wikis except Meta, MediaWiki, and German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) [23:12:55] (03CR) 10BryanDavis: "Seems good conceptually. I started to question the SQL generation and injection possibilities, but then thought that being able to do some" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428550 (owner: 10Bstorm) [23:14:07] uhm... anyone? SWAT ? [23:14:27] (03CR) 10MaxSem: Disable Datetime Selector on Special:Block on all wikis except Meta, MediaWiki, and German Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) (owner: 10Dbarratt) [23:16:03] (03CR) 10Dbarratt: Disable Datetime Selector on Special:Block on all wikis except Meta, MediaWiki, and German Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) (owner: 10Dbarratt) [23:20:21] mooeypoo: if its urgent maybe ping chad sine I can see some activity from them but they aren't one of the listed swaters for that time frame [23:21:08] yeah, he's not listed as doing SWAT. There has to be *someone* that can do that [23:22:04] there doesn't need to be "If no SWAT team member is around and available to do the deploys the window will be skipped. Please reschedule your patches." [23:22:23] mooeypoo: what do you need deployed? [23:23:16] https://gerrit.wikimedia.org/r/#/c/428730/ [23:23:16] legoktm: https://gerrit.wikimedia.org/r/#/c/428730/ [23:23:25] **thank you** [23:23:26] ok, I can sync that out [23:23:27] we definitely have too many people listed as SWAT-ters, it means often no one actually ends up doing it, because they assume someone else will :/ [23:24:09] Yeah :\ [23:24:29] MatmaRex: yeah, I'm unhappy about it as well. :( Aside from doing an official rotation (yet another thing to manage the scheduling of...) I'm not sure how else to do it. [23:25:30] greg-g: it's been working before :) [23:25:45] Accidentally, it seems like [23:27:32] like all good things, by the graces of good people [23:27:52] Aye [23:27:53] there have been times when no one shows up to do it, but it's by far the minority of times [23:28:06] Yeah I don't think I've ever had a time like this one [23:28:28] I've had times where people were "meh, fine, I'll do it" but I haven't had a time where I actually thought there would be no one [23:29:16] !log starting Cassandra bootstrap, restbase1010-c -- T189822 [23:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:22] T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822 [23:29:29] mooeypoo: live on mwdebug1002, please test [23:30:25] RECOVERY - DPKG on restbase1010 is OK: All packages OK [23:30:34] testing [23:30:35] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [23:30:45] RECOVERY - Check size of conntrack table on restbase1010 is OK: OK: nf_conntrack is 10 % full [23:30:45] RECOVERY - cassandra-b service on restbase1010 is OK: OK - cassandra-b is active [23:30:46] RECOVERY - Check whether ferm is active by checking the default input chain on restbase1010 is OK: OK ferm input default policy is set [23:30:46] RECOVERY - cassandra-a service on restbase1010 is OK: OK - cassandra-a is active [23:30:56] RECOVERY - MD RAID on restbase1010 is OK: OK: Active: 15, Working: 15, Failed: 0, Spare: 0 [23:31:05] RECOVERY - cassandra-c service on restbase1010 is OK: OK - cassandra-c is active [23:31:06] RECOVERY - configured eth on restbase1010 is OK: OK - interfaces up [23:31:06] RECOVERY - Check systemd state on restbase1010 is OK: OK - running: The system is fully operational [23:31:15] RECOVERY - cassandra-c SSL 10.64.0.116:7001 on restbase1010 is OK: SSL OK - Certificate restbase1010-c valid until 2018-08-17 16:11:07 +0000 (expires in 114 days) [23:31:16] RECOVERY - Disk space on restbase1010 is OK: DISK OK [23:31:16] RECOVERY - dhclient process on restbase1010 is OK: PROCS OK: 0 processes with command name dhclient [23:32:18] legoktm: working! \o/ [23:33:41] !log legoktm@tin Synchronized php-1.32.0-wmf.1/extensions/Kartographer/includes/Tag/MapFrame.php: MapFrame: Allow lang="local" to be passed (duration: 01m 17s) [23:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:53] mooeypoo: ^^ [23:34:05] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:34:40] \o/ thank you! [23:34:45] np [23:35:21] legoktm: And from Joe's excited remarks in the next table -- it's definitely in production now :D [23:35:25] RECOVERY - HP RAID on restbase1010 is OK: OK: Slot 0: no logical drives --- Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:1:5 - Controller: OK [23:43:34] 10Operations, 10Phabricator, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4155959 (10greg) [23:44:00] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10Release-Engineering-Team (Watching / External): Logstash no longer captures DB queries in debug mode - https://phabricator.wikimedia.org/T190455#4155960 (10greg) [23:44:19] woot [23:47:06] RECOVERY - IPMI Sensor Status on restbase1010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [23:50:15] RECOVERY - Check the NTP synchronisation status of timesyncd on restbase1010 is OK: OK: synced at Tue 2018-04-24 23:50:06 UTC. [23:52:40] (03PS1) 10Madhuvishy: dumps: Copy web server logs to stat host [puppet] - 10https://gerrit.wikimedia.org/r/428864 (https://phabricator.wikimedia.org/T168486) [23:53:52] (03PS2) 10Chad: Drop MEDIAWIKI_DBLIST_DIR, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428844 [23:53:54] (03PS3) 10Chad: Drop MEDIAWIKI_DIRECTORY_REGEX & MEDIAWIKI_VERSION_REGEX unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428845 [23:53:56] (03PS1) 10Chad: multiversion: Don't use getRealmSpecificFilename where it's not needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428865 [23:55:23] (03CR) 10Madhuvishy: [C: 032] dumps: Copy web server logs to stat host [puppet] - 10https://gerrit.wikimedia.org/r/428864 (https://phabricator.wikimedia.org/T168486) (owner: 10Madhuvishy)