[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170303T0000). [00:00:04] Smalyshev and Krinkle: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:17] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:00:23] o/ [00:01:35] Can I do one quick deploy (the prior window ran long due to nested complications), then do the rest if there is time at the end of the SWAT? [00:01:40] ^ SMalyshev, Krinkle [00:01:52] I can run the SWAT too. [00:02:52] \o [00:03:42] am here [00:03:56] hmm, i'm on the list the bot just didn't mention me [00:04:18] Hi. [00:04:19] matt_flaschen: go ahead for your deployment + the SWAT [00:04:20] mine is super trivial config change, reverting a small test back to what we've been using for months [00:04:57] for me it's https://gerrit.wikimedia.org/r/#/c/340695/ - config change for wdqs metrics collection to collect more metrics [00:09:17] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [00:09:27] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [00:14:14] so, anything happens? [00:14:18] matt_flaschen: ping? [00:16:05] Dereckson, I tested on mwdebug1002, and sync is running now. [00:16:39] !log mattflaschen@tin Synchronized php-1.29.0-wmf.14/extensions/Flow/: Fix autoload data and script (duration: 00m 59s) [00:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:22] ah, wait, mine is actually puppet, so probably should be on puppet swat... [00:17:45] SMalyshev, yeah, I don't have +2 for that repo. [00:17:47] ebernhardson, will you be able to test on mwdebug1002, or sould I skip that? [00:17:55] okie, moving [00:18:10] 06Operations, 13Patch-For-Review, 06Services (watching), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3069642 (10RobH) [00:18:19] (03CR) 10Mattflaschen: [C: 032] Revert "Test disable super_detect_noop script" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340902 (owner: 10EBernhardson) [00:19:41] 06Operations, 13Patch-For-Review, 06Services (watching), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3069098 (10RobH) a:05RobH>03mobrovac This is ready for handoff to services. I'm assuming that @mobrovac would handle this (since he was tracking the #hw-... [00:25:54] (03PS2) 10Mattflaschen: Revert "Test disable super_detect_noop script" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340902 (owner: 10EBernhardson) [00:26:28] (03CR) 10Mattflaschen: [C: 032] Revert "Test disable super_detect_noop script" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340902 (owner: 10EBernhardson) [00:26:45] :(, I didn't see there was a merge conflict. Rebased, +2'ed again. [00:29:32] (03Merged) 10jenkins-bot: Revert "Test disable super_detect_noop script" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340902 (owner: 10EBernhardson) [00:29:44] (03CR) 10jenkins-bot: Revert "Test disable super_detect_noop script" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340902 (owner: 10EBernhardson) [00:30:57] ^ ebernhardson, see question about mwdebug1002 [00:31:22] matt_flaschen: ahh, no test on mwdebug [00:31:28] it only effects indexing operations from job runners [00:35:11] !log mattflaschen@tin Synchronized wmf-config/CirrusSearch-common.php: CirrusSearch: Enable super_detect_noop (duration: 00m 39s) [00:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:12] ^ ebernhardson, done, please test if possible. [00:37:14] i;ll keep an eye on logs, if its a problem indexing will show it quickly (probably alrady) [00:38:52] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3069705 (10Halfak) Hey! Just to confirm, cache_misses are recorded for precaching requests. We should probably exclude them. {T159502} [00:41:50] (03CR) 10Krinkle: [C: 04-1] "Update createTxtFileSymlinks.sh and/or run that. Otherwise this'll automatically be remove the next time it runs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 (owner: 10Jcrespo) [00:42:28] Krinkle, you're up next, Jenkins is running. [00:45:02] matt_flaschen: okay [00:56:17] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:00:31] Krinkle, on mwdebug1002, please test. [01:04:23] matt_flaschen: Testing [01:04:49] matt_flaschen: I'm testing on tin instead, maintenance script, it's a no-op for testwiki, I'll run it for that. [01:05:06] actually, terbium. scap-pull.. [01:05:15] Yeah, thanks. I was trying to find the error I got before to remember if it affected all maint scripts. [01:05:27] https://phabricator.wikimedia.org/P4974 [01:05:55] works on terbium for testwiki [01:06:28] Okay, proceeding. [01:09:08] !log mattflaschen@tin Synchronized php-1.29.0-wmf.14/maintenance/purgeModuleDeps.php: resourceloader: Add purgeModuleDeps.php maintenance script (duration: 00m 40s) [01:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:29] !log mattflaschen@tin Synchronized php-1.29.0-wmf.14/maintenance/cleanupRemovedModules.php: resourceloader: Add purgeModuleDeps.php maintenance script (duration: 00m 40s) [01:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:22] !log mattflaschen@tin Synchronized php-1.29.0-wmf.14/autoload.php: resourceloader: Add purgeModuleDeps.php maintenance script (duration: 00m 39s) [01:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:32] Krinkle, done, please test. [01:11:56] matt_flaschen: Well, it's already on terbium now :) I'll run it on other wikis later today. [01:12:01] It worked on testwiki [01:12:48] !log Restarted tilerator on codfw tileservers to catch latest code changes [01:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:15] Krinkle, good point, but you might want to test the old one too at some point. [01:14:49] SWAT complete [01:14:56] matt_flaschen: Yeah, ran both scripts on test and test2 just now [01:15:02] Okay, cool. [01:15:05] The cleanup one is not obsolete though :) [01:15:07] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 1 failures [01:15:18] I didn't mean old as in obsolete, just "in existence previously" [01:19:08] greg-g, I'd still like to do the Catalan Wikipedia thing if possible. Could I do another window at 5:30 Pacific for that? [01:20:07] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 1 failures [01:25:07] RECOVERY - check_puppetrun on bellatrix is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [01:25:17] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [01:31:29] matt_flaschen: yeah, no worries. [01:33:49] !log terbium$ mwscript purgeModuleDeps.php --wiki test2wiki (T158105) [01:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:55] T158105: "PHP Warning: filemtime(): No such file or directory" about files removed over a year ago - https://phabricator.wikimedia.org/T158105 [01:34:07] !log terbium$ foreachwikiindblist group0 purgeModuleDeps.php (T158105) [01:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:37] !log terbium$ foreachwiki purgeModuleDeps.php (T158105) [01:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:48] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1802.260436 Seconds [01:35:33] Krinkle, is that "yeah" about "in existence previously", or about adding another window? Not sure if greg went home, he is not marked away. [01:35:47] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 17.527934 Seconds [01:36:01] matt_flaschen: existence previously [01:56:37] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:10:37] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:12:44] (03CR) 10Ricordisamoa: Gerrit: Fix bot so that it checks against *-name and *-username (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340801 (https://phabricator.wikimedia.org/T159075) (owner: 10Paladox) [02:25:37] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [02:33:21] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.14) (duration: 13m 28s) [02:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:37] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [02:38:40] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Mar 3 02:38:40 UTC 2017 (duration 5m 19s) [02:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:17] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:04:03] (03PS4) 10Dzahn: Gerrit: Fix bot so that it checks against *-name and *-username [puppet] - 10https://gerrit.wikimedia.org/r/340801 (https://phabricator.wikimedia.org/T76291) (owner: 10Paladox) [04:05:08] (03PS5) 10Dzahn: Gerrit: Fix bot so that it checks against *-name and *-username [puppet] - 10https://gerrit.wikimedia.org/r/340801 (https://phabricator.wikimedia.org/T76291) (owner: 10Paladox) [04:05:57] (03CR) 10Dzahn: "@Ricordisamoa yes, it is, thanks. fixed" [puppet] - 10https://gerrit.wikimedia.org/r/340801 (https://phabricator.wikimedia.org/T76291) (owner: 10Paladox) [04:06:32] (03CR) 10Dzahn: [C: 032] Gerrit: Fix bot so that it checks against *-name and *-username [puppet] - 10https://gerrit.wikimedia.org/r/340801 (https://phabricator.wikimedia.org/T76291) (owner: 10Paladox) [04:07:13] (03PS5) 10Dzahn: Gerrit: Fix bot so it reports the correct user merging the change [puppet] - 10https://gerrit.wikimedia.org/r/340735 (https://phabricator.wikimedia.org/T159441) (owner: 10Paladox) [04:07:36] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#2990133 (10tstarling) @Legoktm do you think this could work like https://gerrit.wikimedia.org/r/... [04:08:13] (03CR) 10Dzahn: [C: 032] Gerrit: Fix bot so it reports the correct user merging the change [puppet] - 10https://gerrit.wikimedia.org/r/340735 (https://phabricator.wikimedia.org/T159441) (owner: 10Paladox) [04:12:00] (03PS3) 10Dzahn: mgmt: script to detect vendor by mgmt ssh banner [puppet] - 10https://gerrit.wikimedia.org/r/340450 (https://phabricator.wikimedia.org/T156673) [04:12:17] PROBLEM - puppet last run on darmstadtium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:13:59] !log legoktm@tin Synchronized php-1.29.0-wmf.14/maintenance/refreshLinks.php: Queue non-recursive updates - https://gerrit.wikimedia.org/r/340920 (duration: 00m 40s) [04:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:10] !log running refreshLinks.php on aawiki [04:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:31] (03PS3) 10Dzahn: mgmt: script to change mgmt password on HP servers [puppet] - 10https://gerrit.wikimedia.org/r/340567 (owner: 10Papaul) [04:19:57] (03CR) 10Dzahn: [C: 032] mgmt: script to change mgmt password on HP servers [puppet] - 10https://gerrit.wikimedia.org/r/340567 (owner: 10Papaul) [04:20:12] (03PS4) 10Dzahn: mgmt: script to change mgmt password on HP servers [puppet] - 10https://gerrit.wikimedia.org/r/340567 (owner: 10Papaul) [04:20:14] (03PS2) 10Dzahn: DHCP: Add DHCP entries for ms-be2028-msbe2039 [puppet] - 10https://gerrit.wikimedia.org/r/340896 (owner: 10Papaul) [04:21:17] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [04:23:44] (03CR) 10Dzahn: [C: 032] "yep, those are HP MACs which matches ticket info" [puppet] - 10https://gerrit.wikimedia.org/r/340896 (owner: 10Papaul) [04:23:53] (03PS3) 10Dzahn: DHCP: Add DHCP entries for ms-be2028-msbe2039 [puppet] - 10https://gerrit.wikimedia.org/r/340896 (owner: 10Papaul) [04:27:06] (03CR) 10Dzahn: "we can test the rewrite rules with apache-fast-test. what it needs is a good list of URLs to test" [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [04:30:20] (03CR) 10Dzahn: "# [*strict*]" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [04:35:54] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3069980 (10Legoktm) Yes, I'd rather implement this with an EtcdConfig class or something instead... [04:38:43] !log planet2001 - reinstall, boot into installer, scheduled downtime (T15943) [04:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:49] T15943: Image redirect implementation is very confusing - https://phabricator.wikimedia.org/T15943 [04:39:13] !log planet2001 last log message was for T159432 [04:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:18] T159432: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432 [04:39:49] 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3069992 (10Dzahn) 20:41 < mutante> !log planet2001 - reinstall, boot into installer, scheduled downtime [04:40:17] RECOVERY - puppet last run on darmstadtium is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [05:17:03] 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3070006 (10Dzahn) instance reinstalled but can't reach puppetmaster, gotta debug what's going on. currently gone from icinga and reachable with install-console after fresh install but no initial puppet run. [05:17:19] 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3070007 (10Dzahn) p:05Triage>03Normal [05:32:42] (03CR) 10Ladsgroup: [C: 031] "LGTM but once it's merged I suggest we run some tests in codfw to see if it works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/340485 (https://phabricator.wikimedia.org/T139372) (owner: 10Filippo Giunchedi) [05:43:10] hi [06:27:20] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:47:32] (03PS1) 10Marostegui: db-codfw.php: Depool db2067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340929 (https://phabricator.wikimedia.org/T159414) [06:51:55] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3070042 (10Joe) >>! In T156924#3069980, @Legoktm wrote: > Yes, I'd rather implement this with an... [06:55:20] RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:30:29] !log installing w3m security updates on trusty (jessie already fixed) [07:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:50] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:05:56] (03CR) 10Giuseppe Lavagetto: pybal::configuration: explicitly set the conftool prefix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340727 (owner: 10Giuseppe Lavagetto) [08:07:00] (03CR) 10Giuseppe Lavagetto: [C: 031] hieradata: add oresrdb in codfw [puppet] - 10https://gerrit.wikimedia.org/r/340485 (https://phabricator.wikimedia.org/T139372) (owner: 10Filippo Giunchedi) [08:10:30] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340929 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [08:11:14] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::webserver: add access log to tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/340931 [08:11:51] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340929 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [08:11:58] (03CR) 10jenkins-bot: db-codfw.php: Depool db2067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340929 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [08:13:39] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2067 - T159414 (duration: 00m 40s) [08:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:43] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [08:14:50] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [08:20:18] !log Deploy alter table s6 on db2067 - T159414 [08:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:23] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [08:21:24] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Upstream: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#3070123 (10MoritzMuehlenhoff) I've filed ht... [08:21:35] Reedy: got a minute to cleanupTitles.php --dry-run for enwiki? - T159515 ? [08:21:36] T159515: Run namespaceDupes.php/cleanupTitles.php on the English Wikipedia to fix horizon: Interwiki link conflict - https://phabricator.wikimedia.org/T159515 [08:22:25] !log Run pt-table-checksum on s2 (nowiki) - T154485 [08:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:32] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [08:27:38] !log upgrading apache on bromine [08:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:33] (03PS1) 10Giuseppe Lavagetto: profile::discovery::client: expose services as well [puppet] - 10https://gerrit.wikimedia.org/r/340935 (https://phabricator.wikimedia.org/T149617) [08:48:31] (03PS2) 10Giuseppe Lavagetto: pybal::configuration: explicitly set the conftool prefix [puppet] - 10https://gerrit.wikimedia.org/r/340727 [08:50:25] I am going to reboot CI Jenkins soonish. For a few minutes [08:56:08] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal::configuration: explicitly set the conftool prefix [puppet] - 10https://gerrit.wikimedia.org/r/340727 (owner: 10Giuseppe Lavagetto) [09:03:35] !log Restarting Jenkins [09:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:54] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::webserver: add access log to tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/340931 [09:06:10] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:09:10] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [09:09:30] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [09:17:03] mmmmm [09:18:26] is it one of the new servers/ [09:18:27] ? [09:18:47] ah yes [09:19:10] nothing on them at the moment [09:22:15] <_joe_> elukey: check for conflicting IPs [09:22:24] <_joe_> that's a classical reason for such flapping [09:23:40] (03PS1) 10Gehel: wdqs: cleanup old GC logs [puppet] - 10https://gerrit.wikimedia.org/r/340940 (https://phabricator.wikimedia.org/T159248) [09:24:31] (03CR) 10Gehel: "This change makes sense in conjunction with https://gerrit.wikimedia.org/r/#/c/340938/, but can be deployed independently." [puppet] - 10https://gerrit.wikimedia.org/r/340940 (https://phabricator.wikimedia.org/T159248) (owner: 10Gehel) [09:27:01] hashar: thanks a lot for the "bundle exec rake spec" tip [09:27:20] elukey: I am up for pairing anytime :] [09:27:40] I just made it work on macos, looks good [09:27:44] the stack is a bit intimidating, but with some doc and practice it is eventually a breeze :) [09:27:56] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3070174 (10jcrespo) [09:28:01] it says that aptrepo is not passing [09:28:03] !log Restarting Jenkins (2) [09:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:14] yeah there are some glitches still. I might have patches pending in gerrit [09:28:25] also some spec fail on Mac OS X [09:28:27] anyhow, really nice :) [09:28:35] thanks a lot for all the work [09:28:42] team effort! :D [09:29:03] specially Alexandros and Giuseppe [09:29:14] <_joe_> what did I do? [09:29:15] and Gehel kitchen kabal [09:29:15] <_joe_> :P [09:29:20] you wrote some spec :] [09:29:31] <_joe_> I only write specs for functions and resources [09:29:52] <_joe_> I find specs of puppet manifests to be as useful as an used chewing gum [09:30:20] elukey: on Linux aptrepo spec pass. Using: bundle exec rake spec:aptrepo [09:30:20] What error do you have? [09:31:16] 06Operations, 07HHVM: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3070176 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff 3.18.1 packages are now built. Next I'll rebuild 3.18.1 in Debian unstable to narrow down whether the hphp/test/quick/json_bigint.ph test failure is related to t... [09:33:37] hashar: could you do me a li'l favor - T159515 ? [09:33:37] T159515: Run namespaceDupes.php/cleanupTitles.php on the English Wikipedia to fix horizon: Interwiki link conflict - https://phabricator.wikimedia.org/T159515 [09:35:08] elukey: try that spec fix https://gerrit.wikimedia.org/r/#/c/331632/ :) [09:35:10] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:35:49] * gehel disagrees with Joe on the usefulness of manifest specs (but don't think he is going to convince Joe...) [09:39:45] <_joe_> gehel: oh I'm open to being convinced otherwise, possibly with practical examples of usefulness [09:40:25] depends on your skill level of puppet and brain ability to gasp all the thing that can possibly go wrong [09:40:39] I have been using spec to migrate Zuul craps to use hiera [09:40:40] <_joe_> I don't think that's the problem [09:41:10] <_joe_> the problem is that testing means basically repeating verbatim what you wrote in the manifest, in another form [09:41:16] let me easily compile erb template and check they are valid and of course catch all the lame typo / duplicate resources etc that have always been a pain for me [09:41:26] I don't see unit test as a way to test correctness, but as a way to test design, as such, testing manifests makes sense to me... [09:41:37] <_joe_> test design? [09:41:43] <_joe_> what do you mean? [09:42:22] if you can write a trivial spec on a manifest, you prove that it is modular enough to be tested, so good design [09:43:18] <_joe_> I don't think that it gives you much more than "this manifest has no external dependencies" [09:43:52] <_joe_> and honestly I use tests (unit/integration/behaviour) to ensure the software behaves the way I intend it to behave [09:44:03] or this manifest has an explicit set of dependencies and I understand how they work [09:44:19] * gehel agrees with integration/behaviour, not with unit [09:44:28] one I stumbled upon is interface module failing with ruby2.4. There is some inline_template() call with bunch of ruby that is hard to execute: https://gerrit.wikimedia.org/r/#/c/336840/ [09:45:20] <_joe_> then, look: if we made some effort to move our code to the role/profile pattern it would do our code much more good than any other predicament about modularity [09:45:26] <_joe_> I can guarantee you this :) [09:45:36] <_joe_> so spend your energy on that :P [09:45:46] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070184 (10jcrespo) [09:46:55] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::webserver: add access log to tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/340931 (owner: 10Giuseppe Lavagetto) [09:48:33] _joe_: agreed! [09:49:19] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070202 (10jcrespo) [09:50:01] <_joe_> ouch [09:50:05] <_joe_> I did something wrong [09:52:07] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070210 (10jcrespo) [09:52:12] <_joe_> interesting [09:52:28] <_joe_> systemctl reload nginx seems to be broken on appservers, uhm [09:54:11] 06Operations, 10ops-codfw, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3070211 (10elukey) Adding a note to check for re-occurrences: ``` 09:09 PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% 09:09 RECOVERY - Host mw22... [09:58:45] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070219 (10jcrespo) [10:02:43] off for a few hours. Be back at 2pm [10:06:11] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070267 (10jcrespo) [10:09:22] (03PS1) 10Filippo Giunchedi: install_server: fix graphite.cfg partition ordering [puppet] - 10https://gerrit.wikimedia.org/r/340946 [10:12:17] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070271 (10Tpt) [10:16:24] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070277 (10jcrespo) [10:18:03] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#1118490 (10jcrespo) [10:19:05] (03PS1) 10Gehel: automate upload of elasticsearch plugins to archiva [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 [10:19:20] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:31:00] PROBLEM - HP RAID on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [10:33:23] !log joal@tin Started deploy [analytics/refinery@1440646]: (no justification provided) [10:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:50] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=81%) [10:40:12] <_joe_> uh? what's up on stat1002? [10:40:26] <_joe_> joal: may have something to do with your deploy? [10:40:45] hey _joe_: it does !!!! [10:41:08] _joe_: stat1002 has a small root partition, therefore deploying sometimes cause this issue [10:41:27] _joe_: elukey is kinda used to me getting in trouble with this, he'll be back soon [10:41:27] <_joe_> joal: uhm, 35G is not small by my definition [10:41:29] <_joe_> but still [10:41:38] <_joe_> ok :) [10:42:13] PROBLEM - MariaDB Slave IO: s1 on db1051 is CRITICAL: CRITICAL slave_io_state could not connect [10:42:18] _joe_: You're absolutely right - but unfortunately scap redeploys fat-jars every times, so it finally piles up fast [10:42:23] PROBLEM - MariaDB Slave SQL: s1 on db1051 is CRITICAL: CRITICAL slave_sql_state could not connect [10:42:33] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1056 - https://phabricator.wikimedia.org/T159410#3070295 (10Volans) [10:42:49] did 51 went down? [10:43:14] jynus: according to tendril restarted 2m ago [10:43:30] it had an OOM [10:43:41] ha [10:44:30] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:57] joal: here I am! [10:46:03] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 13Patch-For-Review, and 6 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#3070301 (10Joe) So, I just found out that the dns cache feature we were supposedly using in HHVM h... [10:46:09] thanks mate :( [10:46:31] sorry _joe_, the refinery uses git-fat and sometimes we need to remove some scap revs [10:46:50] elukey: I'm sorry I didn't check before deploying (I should have :/) [10:47:07] nah it's fine! Blame the analytics ops absent from work! [10:47:12] :) [10:47:14] fixing stat [10:47:15] huhu :) [10:47:18] thanks [10:47:19] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1056 - https://phabricator.wikimedia.org/T159410#3070302 (10Marostegui) Let's get it replaced when @Cmjohnson has sometime [10:47:20] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:48:50] RECOVERY - Disk space on stat1002 is OK: DISK OK [10:48:57] !log joal@tin Finished deploy [analytics/refinery@1440646]: (no justification provided) (duration: 15m 33s) [10:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:12] volans, marostegui let me stop it anyway for upgrade [10:51:00] RECOVERY - HP RAID on dbstore2001 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor [10:53:38] !log Start pt-table-checksum on plwiki (s2) - T154485 [10:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:44] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [11:02:07] !log joal@tin Started deploy [analytics/refinery@1440646]: (no justification provided) [11:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:30] !log joal@tin Finished deploy [analytics/refinery@1440646]: (no justification provided) (duration: 01m 23s) [11:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:16] (03PS2) 10Gehel: automate upload of elasticsearch plugins to archiva [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 [11:05:14] !log stopping mariadb and restarting db1051 for maintenance [11:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:11] !log elukey@tin Started deploy [analytics/refinery@1440646]: (no justification provided) [11:09:13] !log elukey@tin Finished deploy [analytics/refinery@1440646]: (no justification provided) (duration: 00m 02s) [11:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:39] !log elukey@tin Started deploy [analytics/refinery@1440646]: (no justification provided) [11:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:53] !log elukey@tin Finished deploy [analytics/refinery@1440646]: (no justification provided) (duration: 00m 14s) [11:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:12] I am fixing stat1002, sorry for the spam [11:12:30] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [11:24:15] (03PS2) 10Giuseppe Lavagetto: profile::discovery::client: expose services as well [puppet] - 10https://gerrit.wikimedia.org/r/340935 (https://phabricator.wikimedia.org/T149617) [11:26:45] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3070352 (10Joe) An example of an output file is: https://config-master.wikimedia.org/discovery/... [11:27:22] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::discovery::client: expose services as well [puppet] - 10https://gerrit.wikimedia.org/r/340935 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto) [11:31:33] (03PS2) 10Filippo Giunchedi: install_server: fix graphite.cfg partition ordering [puppet] - 10https://gerrit.wikimedia.org/r/340946 [11:33:50] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:33:56] (03CR) 10Filippo Giunchedi: [C: 032] install_server: fix graphite.cfg partition ordering [puppet] - 10https://gerrit.wikimedia.org/r/340946 (owner: 10Filippo Giunchedi) [11:34:56] _joe_: merging yours as well [11:35:12] also there's a trailing whitespace there, tut tut [11:35:13] <_joe_> godog: ouch I did forget, sorry [11:35:45] <_joe_> godog: ouch yeah I know why too, meh [11:35:49] <_joe_> amending [11:37:47] (03PS1) 10Filippo Giunchedi: site: remove alerts from graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/340954 [11:37:50] (03PS1) 10Giuseppe Lavagetto: profile::discovery::client: fix template, whitespace [puppet] - 10https://gerrit.wikimedia.org/r/340955 [11:39:28] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::discovery::client: fix template, whitespace [puppet] - 10https://gerrit.wikimedia.org/r/340955 (owner: 10Giuseppe Lavagetto) [11:40:19] ACKNOWLEDGEMENT - HP RAID on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. Marostegui normally happens when there is high load [11:40:35] (03CR) 10Fdans: "@Krinkle, you're totally right, my apologies for that. I'm still getting the hang of working with gerrit and I'm guessing I didn't pull yo" [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [11:44:46] (03PS1) 10Giuseppe Lavagetto: profile::discovery::client: change key prefix [puppet] - 10https://gerrit.wikimedia.org/r/340957 [11:47:40] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:48:05] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::discovery::client: change key prefix [puppet] - 10https://gerrit.wikimedia.org/r/340957 (owner: 10Giuseppe Lavagetto) [11:50:25] 06Operations, 10netops: asw-a1-codfw spontaneous reboot - https://phabricator.wikimedia.org/T159464#3070397 (10elukey) No results for `show system core-dumps` too. [11:51:05] 06Operations: Package the next LTS kernel (4.9) - https://phabricator.wikimedia.org/T154934#3070398 (10MoritzMuehlenhoff) Debian jessie-backports will follow the kernel from Debian stretch, i.e. 4.9.x. This means that we follow Debian more closely and don't need an internal build (at least until we migrate to th... [11:53:59] (03PS2) 10Filippo Giunchedi: site: remove alerts from graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/340954 [11:56:05] (03CR) 10Filippo Giunchedi: [C: 032] site: remove alerts from graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/340954 (owner: 10Filippo Giunchedi) [11:57:12] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3070404 (10elukey) [12:01:50] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [12:08:08] 06Operations: backup space is used unwisely - https://phabricator.wikimedia.org/T159524#3070416 (10jcrespo) [12:15:40] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:29:20] (03PS1) 10Muehlenhoff: Add Jonas Kress to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/340966 [12:29:48] 06Operations, 10Ops-Access-Requests: Requesting access to wikimedia-operations-channel-op for Luke081515 - https://phabricator.wikimedia.org/T159473#3070465 (10ema) p:05Triage>03Normal [12:33:08] (03PS2) 10Muehlenhoff: Add Jonas Kress to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/340966 [12:43:19] (03PS3) 10Gehel: automate upload of elasticsearch plugins to archiva [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 [12:46:42] (03CR) 10Muehlenhoff: [C: 032] Add Jonas Kress to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/340966 (owner: 10Muehlenhoff) [12:51:20] (03PS4) 10Gehel: automate upload of elasticsearch plugins to archiva [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 [12:54:30] PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:03:00] (03CR) 10DCausse: automate upload of elasticsearch plugins to archiva (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 (owner: 10Gehel) [13:04:14] (03PS5) 10Gehel: automate upload of elasticsearch plugins to archiva [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 [13:04:18] (03CR) 10Gehel: automate upload of elasticsearch plugins to archiva (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 (owner: 10Gehel) [13:08:26] (03Abandoned) 10Hashar: jenkins: move plugins cache from /var/run to /var/cache [puppet] - 10https://gerrit.wikimedia.org/r/340576 (owner: 10Hashar) [13:08:28] (03Abandoned) 10Hashar: jenkins: expand war in /var/cache instead of /var/run [puppet] - 10https://gerrit.wikimedia.org/r/340580 (owner: 10Hashar) [13:11:09] (03PS18) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [13:12:08] !log removed apache2 (rc state) and apache2-utils from analtytics1027 [13:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:59] (03PS19) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [13:22:30] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [13:25:42] (03PS1) 10DCausse: Add a bash script to fetch and update this repo [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340973 [13:29:15] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3070565 (10elukey) Summary before closing: Piwik was showing a lot of errors in the apache logs, we removed them and opened a task to a... [13:31:24] 06Operations, 07HHVM: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3070573 (10MoritzMuehlenhoff) When building on Debian stretch with json-c 0.12.1 the same error occurs (and also an additonal test failure), will open a bug for that. [13:31:31] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340974 [13:34:46] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340974 (owner: 10Marostegui) [13:36:28] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340974 (owner: 10Marostegui) [13:37:37] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340974 (owner: 10Marostegui) [13:37:41] (03CR) 10Gehel: Add a bash script to fetch and update this repo (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340973 (owner: 10DCausse) [13:37:57] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2067 - T159414 (duration: 00m 50s) [13:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:03] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [13:42:05] 06Operations, 07HHVM: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3070602 (10MoritzMuehlenhoff) Reported upstream at https://github.com/facebook/hhvm/issues/7708 [13:55:12] (03PS1) 10DCausse: Upgrade to elastic 5.2.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340977 [14:14:28] 06Operations, 10Analytics, 10Analytics-Cluster: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3070663 (10elukey) [14:18:21] (03PS1) 10Rush: nova: fullstack test increase allowed time to SSH [puppet] - 10https://gerrit.wikimedia.org/r/340979 [14:20:27] (03PS1) 10Elukey: Allow analytics1040 to be reimaged with Debian Jessie [puppet] - 10https://gerrit.wikimedia.org/r/340980 (https://phabricator.wikimedia.org/T159530) [14:20:44] jouncebot: next [14:20:44] In 71 hour(s) and 39 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170306T1400) [14:20:51] jouncebot: now [14:20:52] No deployments scheduled for the next 71 hour(s) and 39 minute(s) [14:28:25] (03CR) 10Rush: [C: 032] nova: fullstack test increase allowed time to SSH [puppet] - 10https://gerrit.wikimedia.org/r/340979 (owner: 10Rush) [14:31:08] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3070691 (10elukey) [14:32:42] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3070663 (10elukey) Just checked the labs instance (analytics project) cdh3-5.eqiad.wmlabs and everything seems working fine (no systemctl de... [14:32:51] for deployment-prep why are we deleting known-hosts? is this some sort of security thing? [14:42:41] 06Operations, 06Labs: openstack instance creation sometimes takes >480s - https://phabricator.wikimedia.org/T159459#3070723 (10chasemp) So, this seems like partially SSH timeouts. I have no problem upping that for now while are still figuring out baselines. The puppet run and setup variance is the most under... [14:48:53] PROBLEM - DPKG on db1045 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:49:33] that is me [14:49:50] I was claning up the space, / had a warning [14:49:53] RECOVERY - DPKG on db1045 is OK: All packages OK [14:49:53] PROBLEM - puppet last run on prometheus2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:52:14] 06Operations, 06Labs: openstack instance creation sometimes takes >480s - https://phabricator.wikimedia.org/T159459#3070728 (10chasemp) So I just caught an instance that had initial issues. > 2017-03-03 14:38:46,717 INFO Creating fullstackd-1488551924 > 2017-03-03 14:44:32,354 INFO servers.labnet1001.nova.ve... [14:52:36] 06Operations, 15User-Elukey: labtestcontrol2001: cron-spam from invoke-rc.d atop _cron - https://phabricator.wikimedia.org/T159532#3070731 (10ema) [14:53:37] 06Operations, 06Labs, 15User-Elukey: labtestcontrol2001: cron-spam from invoke-rc.d atop _cron - https://phabricator.wikimedia.org/T159532#3070758 (10ema) [14:56:45] (03PS6) 10Gehel: automate management of elasticsearch plugin repository [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 [14:57:33] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:02:34] 06Operations, 06Labs: openstack instance creation sometimes takes >480s - https://phabricator.wikimedia.org/T159459#3070764 (10chasemp) load across active labvirts ```labvirt1001 15:01:30 up 128 days, 20:30, 1 user, load average: 52.25, 47.77, 48.21 labvirt1004 15:01:32 up 121 days, 19:37, 0 users, load... [15:03:35] 06Operations, 06Labs: openstack instance creation sometimes takes >480s - https://phabricator.wikimedia.org/T159459#3070765 (10chasemp) labvirt1001 is handling more than it's share of load here, and I'm wondering if the scheduler is fairly weighted across these nodes that are at this moment unfairly allocated.... [15:06:14] (03PS1) 10Rush: nova: remove labvirt100[12] from scheduler pool to rebalance [puppet] - 10https://gerrit.wikimedia.org/r/340986 [15:11:17] 06Operations: backup space is used unwisely - https://phabricator.wikimedia.org/T159524#3070416 (10Andrew) There's definitely no need to backup labtestweb. Silver is important to back up since it contains our techincal documentation... we have an offsite backup of it at https://wikitech-static.wikimedia.org/wik... [15:12:08] (03CR) 10DCausse: automate management of elasticsearch plugin repository (034 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 (owner: 10Gehel) [15:12:47] (03PS1) 10Jcrespo: Starting to refactor mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 [15:14:12] (03PS2) 10Jcrespo: Starting to refactor mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 [15:14:34] 06Operations, 06Labs: openstack instance creation sometimes takes >480s - https://phabricator.wikimedia.org/T159459#3070787 (10chasemp) >>! In T159459#3070765, @chasemp wrote: > labvirt1001 is handling more than its share of load here, and I'm wondering if the scheduler is fairly weighted across these nodes th... [15:15:43] (03CR) 10Andrew Bogott: [C: 031] "As long as we don't forget we did this" [puppet] - 10https://gerrit.wikimedia.org/r/340986 (owner: 10Rush) [15:16:23] (03CR) 10Rush: [C: 032] nova: remove labvirt100[12] from scheduler pool to rebalance [puppet] - 10https://gerrit.wikimedia.org/r/340986 (owner: 10Rush) [15:16:35] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#3070789 (10Cmjohnson) The disk has been swapped. Rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: R... [15:16:55] 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on ms-be1012 - https://phabricator.wikimedia.org/T157237#3070790 (10Cmjohnson) The ssd has been swapped...will need to be added back to raid cfg [15:17:53] RECOVERY - puppet last run on prometheus2002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:20:56] (03PS3) 10Jcrespo: Starting to refactor mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 [15:25:33] RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:26:07] (03PS4) 10Jcrespo: Starting to refactor mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 [15:27:52] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#3070820 (10Marostegui) Thanks! I will keep an eye and close this ticket once it is all good! [15:29:51] (03PS7) 10Gehel: automate management of elasticsearch plugin repository [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 [15:29:56] (03CR) 10Gehel: automate management of elasticsearch plugin repository (034 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 (owner: 10Gehel) [15:32:11] 06Operations: Puppet constantly trying to stop the already stopped puppetmaster process on Trusty - https://phabricator.wikimedia.org/T159536#3070822 (10Andrew) [15:42:24] (03PS1) 10Giuseppe Lavagetto: discovery: add parsoid entry [puppet] - 10https://gerrit.wikimedia.org/r/340992 [15:42:26] (03PS1) 10Giuseppe Lavagetto: realm: remove parsoid_site, switch to discovery. [puppet] - 10https://gerrit.wikimedia.org/r/340993 [15:42:28] (03PS1) 10Giuseppe Lavagetto: discovery: add more DNS entries [puppet] - 10https://gerrit.wikimedia.org/r/340994 [15:42:30] (03PS1) 10Giuseppe Lavagetto: realm: remove graphoid_site [puppet] - 10https://gerrit.wikimedia.org/r/340995 [15:42:32] (03PS1) 10Giuseppe Lavagetto: realm: get rid of more entries [puppet] - 10https://gerrit.wikimedia.org/r/340996 [15:42:34] (03PS1) 10Giuseppe Lavagetto: realm: remove rb_site [puppet] - 10https://gerrit.wikimedia.org/r/340997 [15:42:36] (03PS1) 10Giuseppe Lavagetto: discovery: add api endpoint [puppet] - 10https://gerrit.wikimedia.org/r/340998 [15:42:38] (03PS1) 10Giuseppe Lavagetto: realm: remove most references to mwprimary where dns discovery should be enough. [puppet] - 10https://gerrit.wikimedia.org/r/340999 [15:42:40] (03PS1) 10Giuseppe Lavagetto: discovery: remove app_routes, switch mwprimary [puppet] - 10https://gerrit.wikimedia.org/r/341000 [15:43:13] <_joe_> bblack: ^^ when the dns is ready :) [15:44:12] are all those *oids actually active:active today? [15:44:19] <_joe_> yes [15:44:20] (capable without issue, I mean) [15:44:21] ok [15:45:00] <_joe_> only things not active-active are the few eqiad-only things, mediawiki, swift, what else... dunno [15:45:07] <_joe_> ores maybe [15:45:25] <_joe_> godog: do you think ores can work active-active? [15:46:20] _joe_: I'd guess so but I'm not by far the expert [15:46:39] <_joe_> I think it can, but we have to solve the precaching-in-both DCs thing [15:46:53] <_joe_> I'm pretty sure CP can do it from a previous discussion [15:50:33] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3070865 (10debt) Moving this to done as it seems like everything is finished with this fix. Please comment/reopen if t... [15:52:52] _joe_: just FYI - the "stretch" stuff for the caches that is nearing completion, can also active/active any of these that are public-facing (e.g. RB, cxserver, citoid). We can initially configure them active/passive of course, but the capability is there and we'll want to turn on where it makes sense [15:53:19] at which point for that public traffic, it's very split. user IPs that map into ulsfo+codfw will hit the app in codfw, and users mapping to esams or eqiad will hit eqiad. [15:54:53] _joe_, just saw the ORES ping. Let me know if you want to chat about precaching and active-active. [15:59:20] (03PS1) 10Jcrespo: Add more config options, so they can be tuned without new templates [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341004 [16:01:13] PROBLEM - MD RAID on ms-be1012 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 4, Failed: 0, Spare: 2 [16:01:14] ACKNOWLEDGEMENT - MD RAID on ms-be1012 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 4, Failed: 0, Spare: 2 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T159540 [16:01:18] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1012 - https://phabricator.wikimedia.org/T159540#3070903 (10ops-monitoring-bot) [16:04:14] 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on ms-be1012 - https://phabricator.wikimedia.org/T157237#3070934 (10fgiunchedi) [16:04:15] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1012 - https://phabricator.wikimedia.org/T159540#3070936 (10fgiunchedi) [16:07:33] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:08:07] (03CR) 10Marostegui: [C: 031] Add more config options, so they can be tuned without new templates [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341004 (owner: 10Jcrespo) [16:08:12] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3070961 (10debt) 05Open>03Resolved a:03debt [16:11:13] 06Operations, 06Labs, 06Release-Engineering-Team, 07Nodepool: Investigate why nodepool keeps leaking instances and why it stops for no reason sometimes - https://phabricator.wikimedia.org/T159543#3070970 (10Paladox) [16:11:55] 06Operations, 06Labs, 06Release-Engineering-Team, 07Nodepool: Investigate why nodepool keeps leaking instances and why it stops for no reason sometimes - https://phabricator.wikimedia.org/T159543#3070982 (10Paladox) p:05Triage>03High [16:15:29] <_joe_> halfak: basically my take is that for now we might want to make cp make the requests to both ores clusters [16:15:34] godog: sorry for the duplicate autogenerated task, do you know why icinga alerted again? [16:16:34] <_joe_> or it could make a job where it first makes one side compute the value, then submits it for precaching on the other? not sure and tbh I'm both too busy and tired to come up with a good solution [16:16:37] volans: I'm assuming because the host appeared back, the ssd has been swapped and I've ran puppet agin [16:17:09] oh was this the one auto-removed from puppetdb? [16:17:11] _joe_, that could make sense. A lot of our baseline computation is CP, but I can't imagine getting around this some other way. [16:17:21] <_joe_> halfak: tbh, I would rather use some hashing so that depending on the request it consistently calls one or the other redis cluster [16:17:22] I understand the busy/tiredness of this [16:17:28] !log Stopped Jenkins from processing builds while instances are being recycled [16:17:32] Right yeah. That sounds good to me too . [16:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:38] <_joe_> halfak: there is no PII/sensitive data in the ores caches, right? [16:17:42] Right [16:18:03] <_joe_> in that case it's even ok to do cross-dc unencrypted traffic, in case [16:18:48] +1 [16:19:03] We'll have a weird issue with task deduplication if we can't split between DCs [16:19:24] We use celery/redis task_ids to manage deduping and that's saves us a lot [16:19:24] (03PS1) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [16:19:43] volans: yep that one [16:19:43] <_joe_> actually: 1) celery should always be local to the dc [16:19:49] right [16:19:50] godog: ok then make sense [16:20:12] <_joe_> 2) caching can be sharded/shared between the DCs, using some redis sharding software [16:20:19] We only care about deduping for the listing-to-rcstream use-case. [16:20:24] <_joe_> the settings for the two are separated, right? [16:20:34] caching can handle the multi-revid request patterns. [16:20:45] Settings are separate, yes. [16:20:50] And it's easy to further separate. [16:21:05] <_joe_> it's enough to have separate settings at the app level [16:21:15] Was thinking we might want to have codfw *write* all cache to eqiad, but read from codfw. [16:21:32] <_joe_> no, my idea is we put a proxy in front of redis for caching [16:21:37] Gotcha [16:22:01] <_joe_> and we actually let this proxy do hashing on the keys in order to shard them [16:22:09] <_joe_> so we get 2x cache capacity [16:22:13] <_joe_> ha [16:22:43] <_joe_> and given that calculating a score is slow, a larger cache should help [16:22:50] <_joe_> even if it has some more latency [16:23:06] (03CR) 10Jcrespo: [C: 032] Remove old CA (ssl='on') and add a new option "socket" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338988 (https://phabricator.wikimedia.org/T157702) (owner: 10Jcrespo) [16:23:17] <_joe_> it's still ~ 30 ms IIRC [16:23:51] _joe_, that sounds reasonable to me. [16:24:04] Our response time for a cached score is ~30 ms already. [16:24:12] (03CR) 10Jcrespo: [C: 032] Add more config options, so they can be tuned without new templates [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341004 (owner: 10Jcrespo) [16:24:14] <_joe_> so yeah that could go up a bit [16:24:22] <_joe_> but you'd get a better cache hit ratio [16:24:27] <_joe_> hopefully [16:24:31] Our expected response time for a score calculation is 1s [16:24:44] So -- way longer. [16:24:54] <_joe_> yes, that was my assumption as well [16:25:13] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:27:24] 06Operations, 06Labs, 06Release-Engineering-Team, 07Nodepool: Investigate why nodepool keeps leaking instances and why it stops for no reason sometimes - https://phabricator.wikimedia.org/T159543#3071028 (10chasemp) a:03Andrew we merged https://gerrit.wikimedia.org/r/#/c/340986/ causing nova services to... [16:27:49] _joe_, anything you'd like me to turn into a task for my team (aka me)? [16:28:15] <_joe_> halfak: not really, I should just write down this on the task where we are already talking with Amir [16:28:40] OK great. Thanks for looking at this with me/us :) [16:31:16] (03PS1) 10Marostegui: site.pp: Enable ROW binlog for db1070 [puppet] - 10https://gerrit.wikimedia.org/r/341007 (https://phabricator.wikimedia.org/T153743) [16:36:43] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:38:51] 06Operations, 13Patch-For-Review: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#3071038 (10Cmjohnson) [16:38:53] 06Operations, 10ops-eqiad, 10hardware-requests: decom carbon - https://phabricator.wikimedia.org/T158020#3071036 (10Cmjohnson) 05Open>03Resolved Removed from rack [16:41:50] (03PS1) 10Andrew Bogott: Revert "Nova: Turn off Verbose logging" [puppet] - 10https://gerrit.wikimedia.org/r/341013 [16:44:13] (03PS2) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [16:44:37] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: Decommission analytics1026 and analytics1015 - https://phabricator.wikimedia.org/T147313#3071042 (10Cmjohnson) 05Open>03Resolved Removed from rack [16:45:22] 06Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#3071049 (10Cmjohnson) [16:45:24] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: Decommission labsdb1002 - https://phabricator.wikimedia.org/T146455#3071047 (10Cmjohnson) 05Open>03Resolved Removed from rack [16:46:54] 06Operations, 10ops-eqiad: Investigate strontium disk issues on 2016-08-05 - https://phabricator.wikimedia.org/T142187#3071052 (10Cmjohnson) [16:46:56] 06Operations, 10ops-eqiad: Decommission strontium - https://phabricator.wikimedia.org/T142722#3071050 (10Cmjohnson) 05Open>03Resolved Server has been decom'd and removed from rack [16:51:44] 06Operations, 10Ops-Access-Requests: Requesting access to wikimedia-operations-channel-op for Luke081515 - https://phabricator.wikimedia.org/T159473#3071057 (10RobH) So at the time of discussion, I didn't really post any followup questions (I was a bit too busy task swapping when @Luke081515 and I chatted in i... [16:54:13] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:58:01] (03PS1) 10Addshore: Add beta hewiktionary to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/341014 (https://phabricator.wikimedia.org/T158628) [17:03:13] RECOVERY - MegaRAID on db1053 is OK: OK: optimal, 1 logical, 2 physical [17:04:51] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3071105 (10Joe) So I thought a bit about it and come up with the following alternative solution 1) the celery side of the redis for ores MUST NOT be repli... [17:07:45] (03CR) 10Dzahn: [C: 031] Add beta hewiktionary to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/341014 (https://phabricator.wikimedia.org/T158628) (owner: 10Addshore) [17:07:53] PROBLEM - puppet last run on mc1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:12:13] 06Operations, 10Icinga: decom neon (shutdown neon (icinga) after it has been replaced ) - https://phabricator.wikimedia.org/T125023#3071131 (10Cmjohnson) [17:12:20] 06Operations, 10ops-eqiad: decom neon (data center) - https://phabricator.wikimedia.org/T150490#3071128 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Wiped and removed from rack [17:12:54] (03CR) 10Dzahn: [C: 031] Revert "ldap: Add warning to ldaplist" [puppet] - 10https://gerrit.wikimedia.org/r/337842 (https://phabricator.wikimedia.org/T114063) (owner: 10Hashar) [17:14:26] (03Merged) 10jenkins-bot: Remove old CA (ssl='on') and add a new option "socket" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338988 (https://phabricator.wikimedia.org/T157702) (owner: 10Jcrespo) [17:15:06] (03Merged) 10jenkins-bot: Add more config options, so they can be tuned without new templates [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341004 (owner: 10Jcrespo) [17:15:44] (03CR) 10Dzahn: [C: 031] "python-imaging: This compatibility package is built for Python 2 only." [puppet] - 10https://gerrit.wikimedia.org/r/337248 (owner: 10Reedy) [17:17:11] (03CR) 10Dzahn: "bump" [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) (owner: 10Zhuyifei1999) [17:17:30] (03CR) 10Andrew Bogott: [V: 032 C: 032] Revert "Nova: Turn off Verbose logging" [puppet] - 10https://gerrit.wikimedia.org/r/341013 (owner: 10Andrew Bogott) [17:22:11] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#3071211 (10jcrespo) 05Open>03Resolved All disks are fine, there are 2 with 1 media errors, and one with 2; but probably they will be ok for a while (one year). [17:23:16] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#3071220 (10Marostegui) All good - thanks Chris! ``` root@db1053:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level :... [17:23:36] 06Operations, 06Labs, 06Release-Engineering-Team, 07Nodepool: Investigate why nodepool keeps leaking instances and why it stops for no reason sometimes - https://phabricator.wikimedia.org/T159543#3071222 (10Paladox) p:05High>03Unbreak! Guessing unbreak as ci is down? [17:23:41] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#3071225 (10Marostegui) Good timing @jcrespo! [17:25:48] (03CR) 10Marostegui: "This compiles fine https://puppet-compiler.wmflabs.org/5644/" [puppet] - 10https://gerrit.wikimedia.org/r/341007 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [17:27:38] 06Operations, 10Wikimedia-Apache-configuration: Create 2030.wikimedia.org redirect to Meta portal - https://phabricator.wikimedia.org/T158981#3071236 (10gpaumier) [17:30:25] 06Operations, 10Wikimedia-Apache-configuration: Create 2030.wikimedia.org redirect to Meta portal - https://phabricator.wikimedia.org/T158981#3071238 (10gpaumier) Adding the #operations tag that was missing from this. We have a bit of a time constraint in that we'd need confirmation as soon as possible that th... [17:35:00] !log CI is mostly recovered. It could not spawn instance anymore. The queue is being processed and will take a while to be completed. Check status on https://integration.wikimedia.org/zuul/ | T159543 [17:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:05] T159543: Investigate why nodepool keeps leaking instances and why it stops for no reason sometimes - https://phabricator.wikimedia.org/T159543 [17:36:53] RECOVERY - puppet last run on mc1033 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [17:37:28] (03PS3) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [17:42:17] (03PS4) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [17:42:23] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2192798 (10RobH) It turns out we are out of thermal paste onsite, but I'll order more. Chris will be out for the majority of next week, b... [17:43:40] (03CR) 10Muehlenhoff: "This needs functional testing; python-imaging is not only a transition package, it also provides some compatibility wrapper within the pac" [puppet] - 10https://gerrit.wikimedia.org/r/337248 (owner: 10Reedy) [17:44:00] 06Operations, 10ops-eqiad, 10hardware-requests: Phase out scandium.eqiad.wmnet - https://phabricator.wikimedia.org/T150936#3071399 (10Cmjohnson) 05Open>03Resolved This server has been decom'd and removed from rack. [17:44:27] 06Operations, 10Analytics, 10Analytics-Cluster: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#3071405 (10Cmjohnson) [17:44:29] 06Operations, 10ops-eqiad: decom titanium - https://phabricator.wikimedia.org/T145666#3071402 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson diisks wiped and removed from rack [17:45:11] (03PS5) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [17:46:48] 06Operations, 10ops-eqiad: decom magnesium (data center) - https://phabricator.wikimedia.org/T137006#3071429 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson this has been wiped and removed from rack. [17:47:10] (03Abandoned) 10Gehel: elasticsearch - eqiad servers move to jessie and data on /srv [puppet] - 10https://gerrit.wikimedia.org/r/323158 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [17:47:18] 06Operations, 10Ops-Access-Requests: Write access to Icinga - https://phabricator.wikimedia.org/T159564#3071446 (10cwdent) [17:48:00] (03Abandoned) 10Gehel: elasticsearch - mount elasticsearch data partition with noatime [puppet] - 10https://gerrit.wikimedia.org/r/318117 (owner: 10Gehel) [17:49:09] (03PS6) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [17:55:02] (03PS5) 10Jcrespo: Test refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 [17:58:41] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3071469 (10Cmjohnson) db1070 is under warranty for 2 more months. Requested new part from DEll Congratulations: Work Order SR944780612 was successfully submitted. [18:01:13] RECOVERY - MD RAID on ms-be1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:02:23] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1056 - https://phabricator.wikimedia.org/T159410#3071472 (10Cmjohnson) Disk has been replaced. rebuilding cmjohnson@db1056:~$ sudo megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware state: Rebuild Firmware state: Online,... [18:06:20] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1060 - https://phabricator.wikimedia.org/T158193#3029364 (10Cmjohnson) Replaced disk in slot 4. will wait for it to rebuild and then replace slot 7 Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware st... [18:06:42] *pokes greg-g* any objection to me pushing out a small css fix today / now? https://gerrit.wikimedia.org/r/#/c/340794 [18:07:13] 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on ms-be1012 - https://phabricator.wikimedia.org/T157237#3071476 (10Cmjohnson) 05Open>03Resolved Raid is back to normal. Resolving this task RECOVERY - MD RAID on ms-be1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:09:23] (03CR) 10Jcrespo: "The subtemplates work nicely- https://puppet-compiler.wmflabs.org/5650/" [puppet] - 10https://gerrit.wikimedia.org/r/340987 (owner: 10Jcrespo) [18:14:27] 06Operations, 06Labs, 06Release-Engineering-Team, 07Nodepool: Investigate why nodepool keeps leaking instances and why it stops for no reason sometimes - https://phabricator.wikimedia.org/T159543#3071508 (10Paladox) p:05Unbreak!>03High [18:17:03] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [18:17:53] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4113759 keys, up 123 days 9 hours - replication_delay is 0 [18:20:47] addshore: /me looks [18:21:22] fixing a regression from the train [18:22:04] addshore: sure [18:22:08] * Nemo_bis checks the brakes and railway inclination [18:22:19] greg-g: thanks! :) *doing now* [18:43:06] !log addshore@tin Synchronized php-1.29.0-wmf.14/extensions/RevisionSlider/modules/ext.RevisionSlider.css: T159428 [[gerrit:340794|Quick fix for misplaced tooltips on RTL wikis]] (duration: 00m 42s) [18:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:13] T159428: RevisionSlider tooltip position broken on RTL languages - https://phabricator.wikimedia.org/T159428 [18:43:34] {{done}} [18:54:45] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3071598 (10Marostegui) Excellent! Thank you! [18:59:38] (03PS1) 10Catrope: Enable RCFilters beta feature in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341024 [19:03:34] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3071633 (10Addshore) So, as far as I can see this is ready to go. I would propose a deployment date for t... [19:05:25] (03CR) 10Sbisson: [C: 031] Enable RCFilters beta feature in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341024 (owner: 10Catrope) [19:11:44] !log running refreshLinks.php across small wikis [19:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:48] (03PS1) 10Urbanecm: Update sr.wikibooks logo plus add HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341027 (https://phabricator.wikimedia.org/T159534) [19:14:30] (03CR) 10Dzahn: [C: 031] "how about adding python3-pil for the captcha but not also switching away from python-imaging in this change? they can both be installed in" [puppet] - 10https://gerrit.wikimedia.org/r/337248 (owner: 10Reedy) [19:16:22] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1060 - https://phabricator.wikimedia.org/T158193#3071684 (10Marostegui) Thanks Chris! It will take a long time, so probably best to replace 7 on Monday :-) ``` root@db1060:~# megacli -PDRbld -ShowProg -PhysDrv [32:4] -aALL Rebuild Progress on Device at E... [19:17:53] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 610 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4119905 keys, up 123 days 10 hours - replication_delay is 610 [19:17:53] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 611 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4119666 keys, up 123 days 10 hours - replication_delay is 611 [19:21:53] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4117513 keys, up 123 days 10 hours - replication_delay is 0 [19:22:53] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4117367 keys, up 123 days 11 hours - replication_delay is 0 [19:33:27] !log restart elasticsearch on relforge1001 to update remote reindex whitelist [19:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:14] !log restart elasticsearch on relforge1002 to update remote reindex whitelist [19:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:11] (03PS1) 10Urbanecm: Update bs.wiktionary logo plus add HD version of it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341031 (https://phabricator.wikimedia.org/T159542) [19:36:34] (03PS2) 10Urbanecm: Update sr.wikibooks logo plus add HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341027 (https://phabricator.wikimedia.org/T159534) [19:43:55] (03PS1) 10Addshore: wmgUseInterwikiSorting true for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341032 (https://phabricator.wikimedia.org/T150183) [19:43:57] (03PS1) 10Addshore: wmgUseInterwikiSorting true for wikidata clients, excluding wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183) [19:43:59] (03PS1) 10Addshore: wmgUseInterwikiSorting true for all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341034 (https://phabricator.wikimedia.org/T150183) [19:44:16] RainbowSprinkles: ^^ could you give those a quick once over to make sure that I cam using computed db lists correctly? [19:45:10] (03PS1) 10Urbanecm: Change bs.wiktionary sitename and metanamespace name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341035 (https://phabricator.wikimedia.org/T159538) [19:45:15] Looks right [19:45:18] (03PS1) 10Addshore: Use wmgUseInterwikiSorting for labs from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341036 [19:45:21] RainbowSprinkles: awesome :) ty! [19:45:37] (03CR) 10Addshore: [C: 04-2] "To be scheduled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341032 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [19:46:54] (03CR) 10jerkins-bot: [V: 04-1] wmgUseInterwikiSorting true for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341032 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [19:47:42] (03CR) 10jerkins-bot: [V: 04-1] wmgUseInterwikiSorting true for wikidata clients, excluding wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [19:47:57] (03PS1) 10Dzahn: icinga: set IP for benefactorevents/eventdonations to 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/341037 [19:50:12] (03PS2) 10Addshore: wmgUseInterwikiSorting true for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341032 (https://phabricator.wikimedia.org/T150183) [19:50:22] (03PS2) 10Addshore: wmgUseInterwikiSorting true for wikidata clients, excluding wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183) [19:51:04] (03PS2) 10Addshore: wmgUseInterwikiSorting true for all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341034 (https://phabricator.wikimedia.org/T150183) [19:51:11] (03CR) 10jerkins-bot: [V: 04-1] icinga: set IP for benefactorevents/eventdonations to 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/341037 (owner: 10Dzahn) [19:51:13] (03PS2) 10Addshore: Use wmgUseInterwikiSorting for labs from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341036 [19:52:50] (03CR) 10jerkins-bot: [V: 04-1] wmgUseInterwikiSorting true for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341032 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [19:53:20] (03CR) 10jerkins-bot: [V: 04-1] wmgUseInterwikiSorting true for wikidata clients, excluding wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [19:56:00] bah, symlink! [19:59:23] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [20:00:13] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4117914 keys, up 123 days 11 hours - replication_delay is 35 [20:05:55] (03PS3) 10Addshore: wmgUseInterwikiSorting true for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341032 (https://phabricator.wikimedia.org/T150183) [20:10:31] (03PS3) 10Addshore: wmgUseInterwikiSorting true for wikidata clients, excluding wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183) [20:10:38] (03PS3) 10Addshore: wmgUseInterwikiSorting true for all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341034 (https://phabricator.wikimedia.org/T150183) [20:10:46] (03PS3) 10Addshore: Use wmgUseInterwikiSorting for labs from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341036 [20:18:42] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1056 - https://phabricator.wikimedia.org/T159410#3071891 (10Marostegui) 05Open>03Resolved a:03Marostegui This looks good now! Thanks ``` root@db1056:~# megacli -PDRbld -ShowProg -PhysDrv [32:1] -aALL Device(Encl-32 Slot-1) is not in rebuild process... [20:22:05] (03CR) 10Mattflaschen: [C: 032] Enable RCFilters beta feature in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341024 (owner: 10Catrope) [20:23:13] RECOVERY - MegaRAID on db1056 is OK: OK: optimal, 1 logical, 2 physical [20:23:42] (03Merged) 10jenkins-bot: Enable RCFilters beta feature in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341024 (owner: 10Catrope) [20:23:56] (03CR) 10jenkins-bot: Enable RCFilters beta feature in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341024 (owner: 10Catrope) [20:24:13] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 614 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4120189 keys, up 123 days 12 hours - replication_delay is 614 [20:24:14] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 615 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4120726 keys, up 123 days 11 hours - replication_delay is 615 [20:25:40] 06Operations, 10Ops-Access-Requests: Write access to Icinga - https://phabricator.wikimedia.org/T159564#3071906 (10RobH) a:03RobH So by default you can ack anything that you are alerted for. The main reason this doesn't work is if there is a mismatch between what your username is in ldap, and what we have y... [20:26:33] !log mattflaschen@tin Synchronized wmf-config/InitialiseSettings-labs.php: Beta Cluster only (duration: 00m 40s) [20:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:48] (03PS2) 10Dzahn: icinga: set IP for benefactorevents/eventdonations to 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/341037 [20:32:59] 06Operations, 10Ops-Access-Requests: Requesting access to wikimedia-operations-channel-op for Luke081515 - https://phabricator.wikimedia.org/T159473#3071917 (10Luke081515) >>! In T159473#3071057, @RobH wrote: > (...) > * Do you currently have ops in any other Wikimedia IRC Channels, or other non-WMF IRC channe... [20:35:28] (03PS3) 10Dzahn: icinga: set IP for benefactorevents/eventdonations to 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/341037 [20:38:56] (03CR) 10Dzahn: [C: 032] icinga: set IP for benefactorevents/eventdonations to 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/341037 (owner: 10Dzahn) [20:41:21] CUSTOM - Host eventdonations.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [20:41:56] 06Operations, 10Ops-Access-Requests: Requesting access to wikimedia-operations-channel-op for Luke081515 - https://phabricator.wikimedia.org/T159473#3068704 (10hashar) I am sponsoring @Luke081515 He has helped on a wide range of things on beta cluster, has the social connections with other IRC channel operator... [20:42:07] CUSTOM is me, and the change above is for getting rid of cruft in Icinga web ui [20:42:42] what matters is that the HTTPS checks are fine. this is just about ICMP [20:46:17] 06Operations, 10Ops-Access-Requests: Write access to Icinga - https://phabricator.wikimedia.org/T159564#3071971 (10RobH) @cwdent: It doesn't look like you are actually setup to be emailed or SMS/paged for any particular services, which is why you cannot acknowledge any of them. I'm happy to help get you setup... [20:46:54] mobrovac: how about a restbase config change in config.labs https://gerrit.wikimedia.org/r/#/c/341014/ [20:49:19] (03PS8) 10Dzahn: contint: slave role for Saucelabs jobs [puppet] - 10https://gerrit.wikimedia.org/r/338770 (owner: 10Hashar) [20:49:51] (03CR) 10Dzahn: [C: 032] "new role is in labs and already cherry-picked on master there" [puppet] - 10https://gerrit.wikimedia.org/r/338770 (owner: 10Hashar) [20:51:33] (03PS2) 10Urbanecm: Bs.wiktionary namespace changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341035 (https://phabricator.wikimedia.org/T159538) [20:52:16] mutante: if you feel adventurous I have a patch to migrate Jenkins to systemd :D [20:53:21] hashar: i think paladox has labs test instances :) [20:53:35] and let's compile it [20:53:51] which one specifically, i think there is.. hold on [20:53:55] I even created an instance specially to iterate/test it [20:54:07] it is a mess to review though :( [20:54:23] https://gerrit.wikimedia.org/r/#/c/337404/ [20:54:27] I went with several iterations [20:54:36] oh, that was zuul [20:54:40] messed up with systemd entirely but eventually Moritz pointed to the error [20:55:37] ah yeah the zuul-merger is now behind systemd [20:55:40] yep [20:55:43] zuul-server is not yet :/ [20:55:43] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:55:49] the thing that happened to us before was often that it works fine with a unit file to start/stop but then once you reboot it did not come up by itself without a human [20:56:02] hence we did quite a few reboots [20:56:13] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4122283 keys, up 123 days 12 hours - replication_delay is 0 [20:56:25] yup https://phabricator.wikimedia.org/T157785 [20:56:31] gotta find a way to reproduce the issue [20:56:37] but that one was for git-daemon [20:56:46] (yet another service I moved to systemd :D ) [20:56:55] heh, yes [20:56:55] or not [20:56:58] hmm I can't remember [20:57:13] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4122038 keys, up 123 days 12 hours - replication_delay is 0 [20:57:20] yea, make all the things systemd :p [20:57:21] (03CR) 10Paladox: "> we can test the rewrite rules with apache-fast-test. what it needs" [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [20:57:28] (03PS3) 10Paladox: Gerrit: Add some apache rewrite rules for polygerrit [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) [21:03:24] paladox: that patch I would wait for chad [21:03:26] hashar: so this one works after reboot? [21:03:32] Ok [21:03:46] hashar: we can test it on contin2001 while disabling puppet on contint1001, right [21:03:58] paladox: but if we go to polygerrit, most probably we will want to namespace it under /polygerrit/ or something similar. And have it support not running from / (your patch to upstream) [21:04:09] mutante: for jenkins? yeah should do [21:04:11] it all looks good to me. afaict. with the "spec" file i just trust you , heh [21:04:24] It will work on /r/ [21:04:25] since it uses a cookie [21:04:26] to siwtch to polygerrit [21:04:35] the issue with the daemon not working was for git-daemon which really is a sysvinit and somehow that confuses Puppet [21:05:04] mutante: I think Moritz managed to find the various gotcha I have missed :} [21:05:07] thats supported from 2.14 + it has a config to have both polygerrit and gwt enabled at the same time [21:05:14] (03CR) 10Dzahn: [C: 031] "let's disable puppet on contint1001, merge and see it is ok on contint2001 first" [puppet] - 10https://gerrit.wikimedia.org/r/337404 (owner: 10Hashar) [21:05:17] I rebased the patch today so it does not depend on anything [21:05:27] also the patch ^^ makes it more usable then it would be on the /r/ prefix. [21:05:42] !log disabled puppet on contint1001 [21:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:04] (03PS20) 10Dzahn: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 (owner: 10Hashar) [21:06:15] paladox: good to know there is a config switch :) [21:06:24] Yep :) [21:06:33] then polygerrit has a nodejs daemon running ? [21:06:39] Nope [21:06:47] that is only needed when running the tests [21:06:50] hashar: yes, +1 from Moritz sure helps :) [21:07:14] yeah he gave me some good advices and docs related to systemd [21:07:26] and https://github.com/gerrit-review/gerrit/blob/master/polygerrit-ui/run-server.sh [21:07:34] I originally wanted to keep the /etc/default/jenkins to keep the change minimal but that does not work [21:07:35] but it now runs in gerrit [21:08:09] paladox: and it is in go :} [21:08:25] schedules downtime for puppet on contint1001 [21:08:30] Well it is but gerrit dosent use that file. Polygerrit runs internally in gerrit [21:09:20] and with that config it allows you to to enable polygerrit (WHich you use a cookie, ?polygerrit=1) [21:09:35] paladox: what's a grunt-banana? [21:09:42] mutante: I will run puppet with puppet-run so we get log in /var/log/puppet.log [21:10:05] hashar: just waiting for Verified. ok, cool! [21:10:14] * hashar blames jenkins [21:10:24] Isen't that what we use in mediawiki to test i18n/*.json files [21:10:29] it doesn't want to be converted [21:11:09] paladox: i dunno, but you made a change to a grunt-banana-checker [21:11:26] Oh ah yep thats for the i18n/*.json files [21:11:38] aha, was just a curious name [21:12:44] PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:12:46] yep [21:15:33] (03CR) 10Dzahn: [C: 032] jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 (owner: 10Hashar) [21:16:08] hashar: submitting on master ... [21:16:17] * hashar rolls the drums [21:16:52] so you are running it with the log .. i'm standing back [21:16:57] running it [21:17:11] I took a capture of the output of systemctl status jenkins in /root/ [21:17:15] for comparison [21:17:19] nice [21:17:31] the good thing with that patch is that I got more familiar with systemd [21:17:39] and how the logs have to be sent via rsyslog [21:17:55] applied on contint2001 [21:17:57] :) [21:18:29] hmm [21:18:41] I am disappointed [21:18:45] also we are expecting to not see the Icinga check for systemd units reporting [21:18:57] diff jenkins-status jenkins-systemd-status [21:19:03] is basically a noop [21:19:08] (03PS1) 10RobH: this updates fundraising team members for icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/341051 [21:19:39] ah let me restart it [21:19:50] yea, can you stop /start/restart with that unit file [21:19:58] also we can reboot it if you want [21:20:27] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3072056 (10tstarling) >>! In T156924#3070042, @Joe wrote: > - Given the way PHP works, we can't... [21:20:44] Mar 03 21:19:45 contint2001 jenkins[102489]: Unrecognized option: --accessLoggerClassName=winstone.accesslog.SimpleAccessLogger [21:20:45] :( [21:21:15] (03PS2) 10Jgreen: this updates fundraising team members for icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/341051 (owner: 10RobH) [21:21:18] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Write access to Icinga - https://phabricator.wikimedia.org/T159564#3072058 (10cwdent) @robh Contact info is correct on officewiki, and I'd like to receive SMS. Time zone is MST. My carrier is Google fi so it switches, usually between T-Mobile and Sprin... [21:21:53] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:22:03] PROBLEM - jenkins_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [21:22:48] (03CR) 10Jgreen: [C: 031] this updates fundraising team members for icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/341051 (owner: 10RobH) [21:22:57] (03CR) 10RobH: [C: 032] this updates fundraising team members for icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/341051 (owner: 10RobH) [21:23:20] ACKNOWLEDGEMENT - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn conversion to systemd https://gerrit.wikimedia.org/r/#/c/337404/ [21:23:21] ACKNOWLEDGEMENT - jenkins_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war daniel_zahn conversion to systemd https://gerrit.wikimedia.org/r/#/c/337404/ [21:23:21] ACKNOWLEDGEMENT - jenkins_zmq_publisher on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused daniel_zahn conversion to systemd https://gerrit.wikimedia.org/r/#/c/337404/ [21:24:05] mutante: yeah sorry I screwed up that one [21:24:06] hashar: wanna first just disable the "hack to have access log" part ? [21:24:09] there are two set of options [21:24:11] some for java [21:24:12] and see if the rest is cool [21:24:14] and others for jenkins itself [21:24:27] something like: java $JAVA_ARGS -jar jenkins.war $JENKINS_ARGS [21:24:28] ah [21:24:35] and I have some jenkins args passed to java [21:24:41] so java complains it does not recognize them bah [21:24:43] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [21:24:51] I swear I tested on the instance :/ [21:24:56] i was looking at that line.. gotcha [21:25:01] no worries [21:26:13] PROBLEM - carbon-cache@b service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is failed [21:26:13] PROBLEM - carbon-cache@a service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@a is failed [21:26:13] PROBLEM - carbon-cache@c service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [21:26:14] PROBLEM - carbon-cache@d service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@d is failed [21:26:14] PROBLEM - carbon-local-relay service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-local-relay is failed [21:26:23] PROBLEM - carbon-frontend-relay service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-frontend-relay is inactive [21:26:53] PROBLEM - Check systemd state on graphite2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:27:03] PROBLEM - carbon-cache@g service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@g is failed [21:27:03] PROBLEM - carbon-cache@f service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is failed [21:27:03] PROBLEM - carbon-cache@h service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is failed [21:27:04] PROBLEM - carbon-cache@e service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is failed [21:27:20] mutante: I am working on a patch + spec :( [21:27:44] hashar: ok, it's cool, it's just 2001 :) [21:28:03] hrmm.. trying to run puppet on einsteinium, its just stuck at loading facts [21:28:10] im not sure if my session just got borked or what... [21:28:25] robh: it's just slow.. one moment it will start doing stuff all at once [21:28:33] it wasnt this slow yesterday ;_; [21:28:54] hmm.. dunno, i also made a change and wanted to run puppet but it was disabled and i saw your message why :) [21:29:12] yeah jeff and i were actively changing things between private and public repo [21:29:17] and if they were out of sync it would break icinga [21:29:19] *nod* [21:29:23] yea [21:29:40] removing a contact was the thing that would break [21:29:53] we also added a bunch of contacts but those dont break things as easily [21:30:06] mutante: so i shouldnt try to ctrl+c out and rerun eh? [21:30:16] heh, i put that, and the next step just started... still slow [21:30:47] mutante: you changed icinga config? [21:31:02] Jeff_Green: it hates our config changes [21:31:18] robh: yes, i expect a change for benefactorevents/eventdonations hosts [21:31:24] robh: how so? [21:31:28] i expect that the IP changes [21:32:09] hope there isn't an issue with that part? what's failing [21:32:36] hrmm, whats the command line arg to show the errors? [21:32:43] checkcofnig just says fail with no detail =P [21:32:51] icinga -v /etc/icinga/icinga.cfg [21:32:51] and i keep forgetting this command. [21:32:57] -v [21:33:09] thx, ok i had that but i wasnt sure it wasnt going to actually restart [21:33:36] Warning: Duplicate definition found for service 'keystone http' on host 'labtestcontrol2001' (config file '/etc/icinga/puppet_services.cfg', starting on line 205283) [21:33:37] nope, it's fine, but should show error [21:33:45] Error: Contact group 'fr-tech' specified in service 'check_redis' for host 'alnilam' (file '/etc/icinga/nsca_frack.cfg', line 804) is not defined anywhere! [21:33:49] there we go [21:33:58] ah [21:34:06] Error: Contact group 'fr-tech' specified in service 'check_redis' for host 'frqueue1001' (file '/etc/icinga/nsca_frack.cfg', line 804) is not defined anywhere! [21:34:09] basically a ton of those [21:34:27] it might just be fixed after another run ? [21:34:32] robh: so that's the 'fundraising' contact group rename [21:34:43] if it just got re-enabled and those were freshly added, i mean [21:35:01] oh fundraising [21:35:06] that iddnt get changed in my patchset =P [21:35:12] so it should be fundraising to 'fr-tech' [21:35:13] heh [21:35:17] i only removed the contact fr-tech [21:35:29] fixing. [21:35:45] i should have noticed when I reviewed, but I was totally distracted by phabricator wierdness [21:35:57] (03PS1) 10Hashar: jenkins: pass access args to jenkins, not java [puppet] - 10https://gerrit.wikimedia.org/r/341071 [21:36:09] mutante: https://gerrit.wikimedia.org/r/341071 should do it [21:36:22] mutante: eg move the args to be passed to jenkins.jar instead of java. [21:37:03] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [21:37:34] (03CR) 10Dzahn: [C: 032] jenkins: pass access args to jenkins, not java [puppet] - 10https://gerrit.wikimedia.org/r/341071 (owner: 10Hashar) [21:37:45] I swear sometimes icinga-wm just needs to take a chill pill [21:37:48] (03PS1) 10RobH: rename contactgroup fundraising to fr-tech [puppet] - 10https://gerrit.wikimedia.org/r/341075 [21:37:52] I am so p*** of by my change screw up :/ [21:38:06] Zppix: it's correct about the correctness not being correct [21:38:24] (03CR) 10RobH: [C: 032] rename contactgroup fundraising to fr-tech [puppet] - 10https://gerrit.wikimedia.org/r/341075 (owner: 10RobH) [21:38:27] hashar: ive done worst remember when i screwed up grrrit-wm [21:38:36] hurry up zuul i wanna unbreak icinga =P [21:38:40] mutante: ik [21:38:54] robh: thats me with every patch :D [21:39:03] (03PS2) 10RobH: rename contactgroup fundraising to fr-tech [puppet] - 10https://gerrit.wikimedia.org/r/341075 [21:39:24] hashar: you can run puppet again on 2001... now [21:39:30] roger [21:39:36] ok, no one else in zuul ahead of me in operations puppet repo [21:39:44] unless someone manually +v ahead of me im ok... [21:39:47] * robh twitches [21:39:52] did not :) [21:40:34] robh: ok ill submit a change to typos in puppet xD [21:40:47] no way im done in the clear [21:40:47] /lib/systemd/system/jenkins.service looks better [21:40:54] you can resume smashing zuul to bits ;] [21:41:05] * Zppix grabs hammer [21:41:44] RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [21:41:56] ok all merged and running on icinga host again [21:42:26] :) hides under hoodie from evil day star, it's nice outside but soo bright for the screen [21:42:56] at least icinga doesnt just start itself and shit the bed [21:43:01] our nagios install used to do that. [21:43:06] robh: cool, did you happen to see that benefactorevents change? [21:43:09] looks [21:43:15] Nagios, can burn [21:43:18] and its all good [21:43:20] robh: yes, i was about to say that. not anymore :) [21:43:25] mutante: so much scroll i didnt specifically [21:43:34] but it realoaded so you should be able to find it [21:43:42] Jeff_Green: ^ success [21:43:47] (03PS1) 10Hashar: jenkins: actually pass args to jenkins.war [puppet] - 10https://gerrit.wikimedia.org/r/341085 [21:43:50] robh: woot! [21:43:54] mutante: I am very terrible :/ [21:43:59] ok, so now im going to update the private repo with casey's sms info [21:44:14] Jeff_Green: tell him welcome to hell and then laugh maniacally [21:44:26] mutante: I did all the fix but forgot the most important one which was to update the systemd service so the args get passed to jenkins.war :\ [21:44:40] * Zppix slaps hashar [21:44:51] robh: looks like it's not there yet, but also nothing broken.... will look again [21:44:58] robh: maybe I'll just drop some random service and see how he responds to unannounced pagerfire :-) [21:45:06] 06Operations, 10fundraising-tech-ops: adding fundraising to icinga to ack alerts - https://phabricator.wikimedia.org/T159576#3072114 (10RobH) 05Open>03Resolved [21:45:15] Zppix: you know that is rather disturbing when working ? This channel is to maintain the infrastructure so it is meant to be serious topic only! [21:45:21] robh: and Jeff_Green play nice you two [21:45:34] Zppix: I don't mind emotes and some fun on other channels though. But this one has some level of seriousness [21:45:53] Zppix: what. [21:46:10] Jeff_Green: im replying to you saying dropping a service [21:46:14] i think he was saying dont cause the new guy to run away from an influx of pages ;] [21:46:29] ^ [21:46:30] (03CR) 10Dzahn: [C: 032] jenkins: actually pass args to jenkins.war [puppet] - 10https://gerrit.wikimedia.org/r/341085 (owner: 10Hashar) [21:46:33] Zppix: i got it :-P [21:47:02] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [21:47:40] hashar: you are good to go for puppet run [21:47:50] done and verifying now [21:47:52] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational [21:48:02] RECOVERY - jenkins_service_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [21:48:03] i like both of those recoveries there [21:48:06] :) [21:48:33] there is no wdiff on the servers :( [21:48:57] hashar: puppet may of not finished yet? [21:49:43] mutante: I have double checked the params and that looks good [21:49:49] the order has slightly changed but it is not a big deal [21:50:02] do we need package wdiff on contint role? [21:50:22] na [21:50:33] ok [21:50:37] I just copied the log on my local machine and ran the diff here :} [21:50:46] hmm here == at home on my laptop [21:50:47] alright [21:51:08] so it seems all good to also do it on contint1001 ? [21:51:11] yeah [21:51:17] will have to time the restart of jenkins [21:51:21] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Write access to Icinga - https://phabricator.wikimedia.org/T159564#3072128 (10RobH) @cwdent: It turns out we don't have anyone in MST getting pages. We can set you up to receive pages 24x7, or only during your waking hours. Since you are the only one... [21:51:45] !log enabling puppet on contint1001 and puppet-run [21:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:49] ah, you want to wait with that then? [21:52:18] doing it [21:52:23] puppet does not refresh the service [21:52:44] Info: Sleeping for 59 seconds (splay is enabled) bouh [21:52:50] heh [21:53:21] also noticed you merged a new CI role. thx! [21:53:39] the idea is eventually to split jenkins in several jenkins instances that are easier to manage [21:53:47] yep, per "already cherry-picked" and labs master [21:54:08] that sounds good , yea [21:54:17] !log restarting Jenkins [21:54:21] only two jobs were running [21:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:30] great [21:54:56] it is back up [21:55:11] jobs got cancelled abruptly and rescheduled [21:56:00] rsyslog works [21:56:30] nice:) did we have a ticket for "all the systemd things" btw [21:56:38] or this one specifically [21:58:41] and I validated another point [21:58:55] when jenkins is restarted via SIGTERM the jobs get all aborted [21:59:01] and nodepool does delete the instances [21:59:18] mutante: success! thanks a ton for that. I am going to update the wiki doc now :} [22:00:09] hashar: good work :) [22:00:16] hashar: :) very nice [22:00:37] I should have tested one more time on a fresh instance before: / [22:01:06] That sounds good something break? [22:01:30] hashar: how do you feel about an actual reboot of 2001? double check if they come back. doesnt have to be right now though [22:02:19] well,i am a bit limited by battery life .. oops [22:02:35] i gotta move somewhere with power [22:02:40] in a bit [22:03:24] mutante: do move. I will reboot contint2001 meanwhile [22:03:39] !log rebooting contint2001 [22:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:03] :) cool, ok [22:05:06] moving [22:05:32] hashar, just to let you know, polygerrit is like gwt now. It dosent need to be started by nodejs or npm. But it requires those when building the gerrit.war though. [22:06:31] paladox: good that is one less thing to deal with so :} [22:07:01] Yep, but i have found one huge bug. Prefixed urls are broken without the apache rewrites rules i came up with [22:07:32] even with the rewrites, writing a comment or rebasing through polygerrit will be broken. But at least polygerrit will be usable, like viewing a diff [22:07:40] and having a mobile site. [22:08:05] They say that it's diff is faster then gwt. [22:08:31] or thats what they saying here https://groups.google.com/forum/#!topic/repo-discuss/9rhEDC6GxoY [22:17:05] mutante: well it works as intended [22:17:26] there is a gotcha which is that on contint2001 the service is not enabled, so it does not spawn on boot [22:17:39] but the icinga monitoring probe are not disabled so that would alarm out [22:18:21] gotta do something like we did for zuul server: disable monitoring when the service is not enabled [22:21:15] hashar: or startup.sh to boot the service automatically? [22:21:37] hashar is that a bug? [22:21:41] try systemctl enable [22:21:52] we don't want the service to start automatically [22:21:57] that machine is an hot spare for now [22:22:08] so jenkins is intentionally not spawning [22:22:29] the issue is that we would still alarm out on it [22:22:34] oh [22:23:06] hashar: ok, that all makes sense. good, yes [22:23:30] mutante: huge thanks really and thank you for your patience :} [22:23:43] like we dud for zuul server, full ack [22:23:46] next week I will try an upgrade of jenkins 1 to jenkins 2 on contint2001 [22:23:55] hashar: no problem, thanks too! [22:24:23] ok [22:25:51] 06Operations, 10Ops-Access-Requests: Requesting access to wikimedia-operations-channel-op for Luke081515 - https://phabricator.wikimedia.org/T159473#3072222 (10RobH) I cannot think of anything else that would be needed to grant this other than no objections during the ops team meeting on Monday. I'll make sur... [22:27:50] sshd-phad service is also about to be converted to base::service_unit [22:28:47] (03PS12) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) [22:28:58] :) [22:29:03] * paladox tests it on phabricator [22:29:22] (03PS13) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) [22:29:53] (03PS10) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [22:29:56] Why does rebasing on operations/puppet take so long? [22:30:26] paladox: what the tests? [22:30:36] the patches above [22:30:41] it's not tests [22:30:44] its rebasing [22:30:52] most likly due to how many objects it has [22:31:11] Repo size the amount of patches it bases over [22:31:37] Yep [22:31:54] it depends on the time of day [22:32:09] or it feels like that [22:32:21] when many other things are already going on or not [22:36:59] (03PS11) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [22:38:12] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:39:48] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Write access to Icinga - https://phabricator.wikimedia.org/T159564#3072261 (10RobH) So Casey is currently setup to receive emails 24x7 and can ack all the fundraising based hosts. Once we ensure that 8-midnight local is acceptable (or if 24x7 is prefe... [22:40:01] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5651/" [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [22:40:15] 06Operations, 10Ops-Access-Requests: Write access to Icinga - https://phabricator.wikimedia.org/T159564#3072262 (10RobH) [22:40:46] (03CR) 10Dzahn: [C: 031] Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [22:46:09] (03CR) 10Paladox: "Tested and works." [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [22:49:42] PROBLEM - puppet last run on mc1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:53:06] (03PS4) 10Paladox: Phabricator: Move sshd-phab.conf.erb and sshd-phab.service.erb into initscripts [puppet] - 10https://gerrit.wikimedia.org/r/339786 [22:56:20] (03PS5) 10Paladox: Phabricator: Move sshd-phab.conf.erb and sshd-phab.service.erb into initscripts [puppet] - 10https://gerrit.wikimedia.org/r/339786 [22:57:41] (03PS6) 10Paladox: Phabricator: Move sshd-phab.conf.erb and sshd-phab.service.erb into initscripts [puppet] - 10https://gerrit.wikimedia.org/r/339786 [22:57:53] (03PS14) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) [22:58:11] (03PS12) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [22:59:12] (03PS15) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) [22:59:29] (03PS16) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) [23:00:10] 06Operations, 06Labs, 06Release-Engineering-Team, 07Nodepool: Investigate why nodepool keeps leaking instances and why it stops for no reason sometimes - https://phabricator.wikimedia.org/T159543#3072330 (10hashar) 05Open>03Resolved Nova / OpenStack recovered. Thus instances managed to get deleted and... [23:00:11] (03PS17) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) [23:01:37] RECOVERY - Host eventdonations.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.07 ms [23:06:17] 06Operations, 10Ops-Access-Requests: Write access to Icinga - https://phabricator.wikimedia.org/T159564#3072353 (10cwdent) @RobH - sorry for the delay, was afk for awhile. Being paged any time is fine, I will keep it on silent mode if need be. [23:06:17] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [23:07:35] PROBLEM - Host eventdonations.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:08:06] ^ don't be alarmed by that, it's nothing, i was just trying to fix cruft about that [23:08:18] doesnt work as i hoped [23:08:40] the host is just fine (on https) [23:11:00] (03PS13) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [23:11:57] (03PS18) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) [23:12:04] (03PS14) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [23:12:10] !log phabricator: restarting apache real quick [23:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:15] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:16:45] RECOVERY - Host eventdonations.wikimedia.org is UP: check_tcp -p 80 [23:17:45] RECOVERY - puppet last run on mc1012 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:19:29] !log icinga: for special external hosts benefactorevents and eventdonations, "submit passive check result for this host" -> "check_tcp -p 80" to avoid "crit hosts" that just don't respond to ICMP (http://www.htmlgraphic.com/nagios-check-host-without-ping/) [23:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:35] PROBLEM - Host eventdonations.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:19:39] meh [23:22:06] RECOVERY - Host eventdonations.wikimedia.org is UP: check_tcp -p 80 [23:23:21] !log phabricator: restarted apache 1 last time, removed hack [23:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:45] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:44:15] RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [23:54:32] 06Operations, 10Ops-Access-Requests: Write access to Icinga - https://phabricator.wikimedia.org/T159564#3072444 (10RobH) 05Open>03Resolved Cool, you should now be all setup. I setup the sms to use the google fi email to sms gateway, so it should work. You are also setup to receive emails for everything y...