[00:00:12] https://gerrit.wikimedia.org/r/#/c/293639/ [00:00:24] ah, yes, the self-merged one is https://gerrit.wikimedia.org/r/#/c/293646/ [00:01:22] so we need to deploy 293646 and 293650? [00:04:37] * aude looks [00:05:07] Dereckson: only https://gerrit.wikimedia.org/r/#/c/293650/ [00:05:21] article placeholder extension is part of the wikidata build [00:05:26] k [00:06:00] and we don't cut a new branch every week, thus the extension is still on wmf.3 and currently deployed core uses that [00:07:07] 06Operations, 10DNS, 10Phabricator, 10Traffic, 13Patch-For-Review: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2370304 (10Dzahn) I don't like it very much, especially not in wikipedia.org. Where would we draw the line here before we add it t... [00:07:24] !log ran mwscript maintenance/updateCollation.php --wiki=tawiktionary --force [00:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:12:14] !log ran mwscript maintenance/updateCollation.php --wiki=tawikisource --force [00:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:12:25] kaldari: thanks for taking care of that [00:12:37] NP, thanks for the SWAT [00:12:52] yw [00:13:20] aude: live on mw1017 [00:14:28] ok [00:14:37] (03PS2) 10Dzahn: mw_rc_irc: flake8 [puppet] - 10https://gerrit.wikimedia.org/r/283666 (owner: 10Ladsgroup) [00:14:49] (03CR) 10Dzahn: [C: 032] "confirmed. ircd_stats.py:3:1: F401 'sys' imported but unused" [puppet] - 10https://gerrit.wikimedia.org/r/283666 (owner: 10Ladsgroup) [00:15:03] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:15:06] looks ok [00:17:29] Dereckson: could you ping me when SWAT is done? Got a small labs-only change that I'd like to merge when you're done. [00:17:32] thcipriani: sure [00:17:37] thanks :) [00:18:03] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:18:31] (03CR) 10Dzahn: "eh, hah, this rebased into thin air and then i merged ..like nothing" [puppet] - 10https://gerrit.wikimedia.org/r/283666 (owner: 10Ladsgroup) [00:19:05] !log dereckson@tin Synchronized php-1.28.0-wmf.5/extensions/Wikidata/composer.lock: Update Wikidata - Fix uncaught exception in ArticlePlaceholder (1/3, no-op) (duration: 00m 27s) [00:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:19:42] (03PS3) 10Dzahn: Remove unused import in labs [puppet] - 10https://gerrit.wikimedia.org/r/279896 (owner: 10Ladsgroup) [00:20:41] scap sync is a little slow tonight [00:20:43] (03CR) 10Dzahn: "this too, i was about to merge then rebased it ..and it wasnt a diff anymore" [puppet] - 10https://gerrit.wikimedia.org/r/279896 (owner: 10Ladsgroup) [00:20:46] :/ [00:20:46] !log dereckson@tin Synchronized php-1.28.0-wmf.5/extensions/Wikidata/vendor/composer/installed.json: Update Wikidata - Fix uncaught exception in ArticlePlaceholder (2/3, no-op) (duration: 00m 25s) [00:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:21:35] !log dereckson@tin Synchronized php-1.28.0-wmf.5/extensions/Wikidata/extensions/ArticlePlaceholder/includes/SearchHookHandler.php: Update Wikidata - Fix uncaught exception in ArticlePlaceholder (3/3) (duration: 00m 25s) [00:21:36] aude: here you are :) [00:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:21:39] checking again [00:21:44] (03Abandoned) 10Dzahn: Remove unused import in labs [puppet] - 10https://gerrit.wikimedia.org/r/279896 (owner: 10Ladsgroup) [00:21:53] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 2.343 second response time [00:21:55] looks good [00:21:57] thanks :) [00:22:23] You're welcome [00:22:24] thcipriani: ping [00:22:32] (03PS2) 10Dzahn: Move misc memcached files into their module [puppet] - 10https://gerrit.wikimedia.org/r/283576 (owner: 10Chad) [00:22:36] Dereckson: thanks :) [00:23:22] (03PS2) 10Thcipriani: ores.wikimedia.org instead of ores.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293419 (owner: 10Ladsgroup) [00:24:00] (03CR) 10Thcipriani: [C: 032] ores.wikimedia.org instead of ores.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293419 (owner: 10Ladsgroup) [00:24:05] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: add Ladsgroup to nda LDAP group (was: Grant graphite.wikimedia.org rights to grafana-admin LDAP group) - https://phabricator.wikimedia.org/T137373#2370332 (10Ladsgroup) I confirm that I was able to login to graphite. Awesome! Thanks. [00:24:26] (03Abandoned) 10Dzahn: Move misc memcached files into their module [puppet] - 10https://gerrit.wikimedia.org/r/283576 (owner: 10Chad) [00:24:36] (03Merged) 10jenkins-bot: ores.wikimedia.org instead of ores.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293419 (owner: 10Ladsgroup) [00:25:48] (03CR) 10Dzahn: ""Cool URLs don't change… but secure.wm.org was never cool" haha, i guess yea" [puppet] - 10https://gerrit.wikimedia.org/r/257510 (https://phabricator.wikimedia.org/T120790) (owner: 10Reedy) [00:27:32] (03CR) 10Dzahn: [C: 031] Add LANG to /etc/defaults/puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/272613 (owner: 10BryanDavis) [00:27:55] !log thcipriani@tin Synchronized wmf-config/CommonSettings-labs.php: [[gerrit:293419|ores.wikimedia.org instead of ores.wmflabs.org]] (duration: 00m 25s) [00:31:26] !log ran mwscript maintenance/updateCollation.php --wiki=tawikiquote --force [00:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:34:27] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [00:36:02] !log git pull on strontium because i merged a non-change [00:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:36:25] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:36:26] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [00:37:22] hrmm. same on palladium [00:38:17] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [00:39:41] !log ran mwscript maintenance/updateCollation.php --wiki=tawikinews --force [00:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:40:20] !log ran mwscript maintenance/updateCollation.php --wiki=tawikibooks --force [00:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:41:29] Dereckson: Do I need to schedule deployments for config changes that only affect labs? Like: https://gerrit.wikimedia.org/r/#/c/287936/ [00:41:49] or can I just merge them? [00:42:55] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:42:57] kaldari: i think you can and the criteria is https://wikitech.wikimedia.org/wiki/Deployments/Inclusion_criteria [00:44:02] afaik [00:46:55] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.767 second response time [00:50:00] (03CR) 10Ori.livneh: Make mwrepl a little more user friendly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283604 (owner: 10EBernhardson) [00:50:04] (03PS3) 10Dzahn: admin: move manybubbles to absented users [puppet] - 10https://gerrit.wikimedia.org/r/293656 (https://phabricator.wikimedia.org/T130113) [00:51:25] (03CR) 10Dzahn: [C: 032] admin: move manybubbles to absented users [puppet] - 10https://gerrit.wikimedia.org/r/293656 (https://phabricator.wikimedia.org/T130113) (owner: 10Dzahn) [00:56:45] !log ran mwscript maintenance/updateCollation.php --wiki=tawiki --force [00:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:23:00] 06Operations, 06Performance-Team, 06Services, 07Availability, 13Patch-For-Review: Create restbase BagOStuff subclass (session storage) - https://phabricator.wikimedia.org/T137272#2370453 (10GWicke) A basic session storage service prototype using hyperswitch is available at https://github.com/gwicke/autho... [01:26:06] I'll deploy some authmanager fixes [01:28:53] (03PS2) 10Dzahn: add lint:ignore's for remaining files outside modules [puppet] - 10https://gerrit.wikimedia.org/r/293569 [01:29:46] (03PS3) 10Dzahn: add lint:ignore's for remaining files outside modules [puppet] - 10https://gerrit.wikimedia.org/r/293569 (https://phabricator.wikimedia.org/T93645) [01:31:35] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [01:42:01] !log tgr@tin Synchronized php-1.28.0-wmf.5/includes/specials/SpecialCreateAccount.php: deploying [[gerrit:293667]]: fix AuthManager dashboard (duration: 00m 25s) [01:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:42:58] !log tgr@tin Synchronized php-1.28.0-wmf.5/includes/specials/SpecialUserLogin.php: deploying [[gerrit:293667]]: fix AuthManager dashboard (duration: 00m 33s) [01:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:44:04] !log tgr@tin Synchronized php-1.28.0-wmf.5/includes/specialpage/LoginSignupSpecialPage.php: deploying [[gerrit:293668]]: fix AuthManager warning spam (duration: 00m 25s) [01:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:46:57] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:50:46] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.312 second response time [01:57:35] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:02:13] (03PS4) 10Dzahn: add lint:ignore's for remaining files outside modules [puppet] - 10https://gerrit.wikimedia.org/r/293569 (https://phabricator.wikimedia.org/T93645) [02:02:54] always great timing grrrit-wm [02:06:27] (03PS1) 10Dzahn: decom cp1043,cp1044 from site.pp, installserver [puppet] - 10https://gerrit.wikimedia.org/r/293669 (https://phabricator.wikimedia.org/T133614) [02:07:28] (03PS2) 10Dzahn: rm cp1043,cp1044 from site, installserver & torrus [puppet] - 10https://gerrit.wikimedia.org/r/293669 (https://phabricator.wikimedia.org/T133614) [02:11:14] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Puppet has 1 failures [02:29:13] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.5) (duration: 11m 32s) [02:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:16] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jun 10 02:35:15 UTC 2016 (duration 6m 2s) [02:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:18] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [02:45:18] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:17] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.022 second response time [03:10:59] (03PS1) 10Aaron Schulz: Lower $wgAPIMaxLagThreshold to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293674 (https://phabricator.wikimedia.org/T95501) [03:14:51] (03CR) 10Aaron Schulz: [C: 032] Lower $wgAPIMaxLagThreshold to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293674 (https://phabricator.wikimedia.org/T95501) (owner: 10Aaron Schulz) [03:15:26] (03Merged) 10jenkins-bot: Lower $wgAPIMaxLagThreshold to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293674 (https://phabricator.wikimedia.org/T95501) (owner: 10Aaron Schulz) [03:17:07] !log aaron@tin Synchronized wmf-config/CommonSettings.php: Lower $wgAPIMaxLagThreshold to 5 (duration: 00m 36s) [03:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:42:16] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:48:07] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.617 second response time [04:11:07] akosiaris: it seems apertium-apy repo doesn't contain upstream source, I'll fix that. [04:24:20] 06Operations, 06Project-Admins, 10Traffic: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2370538 (10Peachey88) 05Open>03declined a:05Danny_B>03None >>! In T134960#2338185, @Aklapper wrote: > Proposing to decline as per last two comments. Done [04:24:45] (03PS3) 10Muehlenhoff: Define backup for contint [puppet] - 10https://gerrit.wikimedia.org/r/293294 (https://phabricator.wikimedia.org/T80385) [04:31:36] (03PS1) 10Papaul: DHCP: Add MAC address for mw2218 and mw2239 to mw2250 Bug:T135466 [puppet] - 10https://gerrit.wikimedia.org/r/293678 (https://phabricator.wikimedia.org/T135466) [04:32:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] Define backup for contint [puppet] - 10https://gerrit.wikimedia.org/r/293294 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [04:39:37] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:41:36] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.027 second response time [04:45:16] ACKNOWLEDGEMENT - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused Muehlenhoff Setup ongoing [04:48:18] !log installing squid3 security updates on Ubuntu systems [04:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:54:17] (03PS3) 10KartikMistry: apertium-dan: New upstream release [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/269912 (https://phabricator.wikimedia.org/T124137) [04:58:32] (03PS2) 10Muehlenhoff: Enable backup for contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/293295 (https://phabricator.wikimedia.org/T80385) [05:01:16] PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 11469 MB (3% inode=99%) [05:02:18] (03PS2) 10Dzahn: DHCP: Add MAC address for mw2218 and mw2239 to mw2250 Bug:T135466 [puppet] - 10https://gerrit.wikimedia.org/r/293678 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [05:02:50] (03CR) 10Dzahn: "this will not hurt but most likely contint1001 will be decomed again anyways" [puppet] - 10https://gerrit.wikimedia.org/r/293295 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [05:03:54] (03CR) 10Dzahn: [C: 032] "no mw2243 but i assume you had a reason" [puppet] - 10https://gerrit.wikimedia.org/r/293678 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [05:03:58] (03PS3) 10KartikMistry: apertium-nob: New upstream release [debs/contenttranslation/apertium-nob] - 10https://gerrit.wikimedia.org/r/269914 (https://phabricator.wikimedia.org/T124317) [05:05:47] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2370583 (10Dzahn) DHCP: Add MAC address for mw2218 and mw2239 to mw2250 https://phabricator.wikimedia.org/T135466 merged, mw2243 was not in it but i assumed papaul had a reason to skip that [05:06:58] (03CR) 10Dzahn: "the current plan seems to be to move all of CI into labs..." [puppet] - 10https://gerrit.wikimedia.org/r/293295 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [05:07:16] (03PS4) 10KartikMistry: apertium-nno: New upstream release [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/269915 (https://phabricator.wikimedia.org/T124137) [05:12:30] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2370584 (10Dzahn) @fgiunchedi btw since we once talked about this on IRC. meanwhile i added you to the labs project for this. https://w... [05:12:30] (03PS3) 10KartikMistry: apertium-dan-nor: New upstream release [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/269916 (https://phabricator.wikimedia.org/T124137) [05:13:49] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2370585 (10Dzahn) a:05Dzahn>03None i call this partially resolved. and now giving back to pool for the moment. going to be on vacation... [05:14:52] 06Operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2370587 (10Dzahn) a:05Dzahn>03Muehlenhoff [05:15:21] 06Operations, 10Icinga, 10Monitoring: re-create script for manual paging - https://phabricator.wikimedia.org/T82937#2370588 (10Dzahn) a:05Dzahn>03None [05:19:08] (03CR) 10Muehlenhoff: [C: 04-1] "Nice typo in the subject :-) One smaller comment, but needs to be discussed in Ops meeting." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293510 (https://phabricator.wikimedia.org/T137412) (owner: 10JanZerebecki) [05:21:39] (03CR) 10KartikMistry: "I have pushed tag, but package needs update." [debs/contenttranslation/giella-core] - 10https://gerrit.wikimedia.org/r/270671 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [05:28:20] (03PS6) 10Gehel: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 [05:50:08] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2370605 (10KartikMistry) [05:53:37] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:54:48] !log re-enabling puppet on carbon [05:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:55:28] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 2.270 second response time [06:17:06] PROBLEM - puppet last run on mw1115 is CRITICAL: CRITICAL: Puppet has 1 failures [06:19:56] Hello! Started re-imaging of mw112[57].eqiad appservers, they will throw alarms later on :) [06:21:07] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:24:07] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:26:06] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 4.747 second response time [06:28:53] !log mw2215-mw2238 -signing puppet certs, salk-key initial run [06:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:47] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:47] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:06] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:08] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.004 second response time [06:33:57] PROBLEM - Disk space on logstash1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=97%) [06:33:57] RECOVERY - Disk space on ms-be2012 is OK: DISK OK [06:34:17] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 6.867 second response time [06:35:26] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:42:13] !log bounced hhvm on mw1264 (backtrace in /tmp/hhvm.2197.bt) [06:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:43:42] moritzm: mw1264 probably it was my fault, got reimaged yesterday and didn't get deployed via scap.. was it stuck? [06:47:17] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: puppet fail [06:47:17] RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:49:39] elukey: should be fine, I'll ack the Apache alert until it's deployed [06:50:17] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [06:51:27] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:52:17] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.027 second response time [06:54:39] moritzm: ah snap didn't see the icinga alert, it must be the one after "the host is not in the DSH group".. so probably doing a scap sync-common (not sure about the syntax) would resolve the problem [06:56:16] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:57:08] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:37] ACKNOWLEDGEMENT - Apache HTTP on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50421 bytes in 0.006 second response time Muehlenhoff in setup by elukey [06:57:46] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:33] (03CR) 10Muehlenhoff: [C: 031] "Looks good, I'm planning to upgrade the remaining firejail installations on 0.9.26 to 0.9.38 and will merge that during the update." [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki) [07:01:17] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:16] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: puppet fail [07:04:12] doing scap-pull on mw1264 [07:04:47] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [07:05:42] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:06:14] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:07:52] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.986 second response time [07:08:04] PROBLEM - Apache HTTP on mw1265 is CRITICAL: Connection timed out [07:08:23] PROBLEM - Apache HTTP on mw2222 is CRITICAL: Connection timed out [07:08:23] PROBLEM - Apache HTTP on mw2220 is CRITICAL: Connection timed out [07:08:23] PROBLEM - Apache HTTP on mw2221 is CRITICAL: Connection timed out [07:09:02] PROBLEM - nutcracker process on mw1265 is CRITICAL: Timeout while attempting connection [07:09:02] PROBLEM - Check size of conntrack table on mw1267 is CRITICAL: Timeout while attempting connection [07:09:02] PROBLEM - nutcracker port on mw2222 is CRITICAL: Timeout while attempting connection [07:09:03] PROBLEM - nutcracker port on mw2221 is CRITICAL: Timeout while attempting connection [07:09:03] PROBLEM - nutcracker port on mw2220 is CRITICAL: Timeout while attempting connection [07:09:22] PROBLEM - puppet last run on mw1265 is CRITICAL: Timeout while attempting connection [07:09:22] PROBLEM - DPKG on mw1267 is CRITICAL: Timeout while attempting connection [07:09:23] PROBLEM - nutcracker process on mw2220 is CRITICAL: Timeout while attempting connection [07:09:23] PROBLEM - nutcracker process on mw2221 is CRITICAL: Timeout while attempting connection [07:09:23] PROBLEM - nutcracker process on mw2222 is CRITICAL: Timeout while attempting connection [07:09:32] mw126[57] are my fault (bootstrapping) [07:09:33] RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.065 second response time [07:09:40] not sure about the codfw ones [07:09:42] PROBLEM - Apache HTTP on mw2217 is CRITICAL: Connection timed out [07:09:43] PROBLEM - Disk space on mw1267 is CRITICAL: Timeout while attempting connection [07:09:43] PROBLEM - salt-minion processes on mw1265 is CRITICAL: Timeout while attempting connection [07:09:43] PROBLEM - configured eth on mw2217 is CRITICAL: Timeout while attempting connection [07:09:43] PROBLEM - puppet last run on mw2222 is CRITICAL: Timeout while attempting connection [07:09:43] PROBLEM - puppet last run on mw2220 is CRITICAL: Timeout while attempting connection [07:09:43] PROBLEM - puppet last run on mw2221 is CRITICAL: Timeout while attempting connection [07:10:02] elukey: new install [07:10:03] PROBLEM - salt-minion processes on mw2221 is CRITICAL: Timeout while attempting connection [07:10:04] PROBLEM - salt-minion processes on mw2220 is CRITICAL: Timeout while attempting connection [07:10:04] PROBLEM - salt-minion processes on mw2222 is CRITICAL: Timeout while attempting connection [07:10:04] PROBLEM - MD RAID on mw1267 is CRITICAL: Timeout while attempting connection [07:10:04] PROBLEM - dhclient process on mw2217 is CRITICAL: Timeout while attempting connection [07:10:09] papaul: o/ [07:10:12] PROBLEM - mediawiki-installation DSH group on mw2217 is CRITICAL: Host mw2217 is not in mediawiki-installation dsh group [07:10:22] PROBLEM - Check size of conntrack table on mw1265 is CRITICAL: Timeout while attempting connection [07:10:32] PROBLEM - DPKG on mw1265 is CRITICAL: Timeout while attempting connection [07:10:32] PROBLEM - Check size of conntrack table on mw2221 is CRITICAL: Timeout while attempting connection [07:10:32] PROBLEM - nutcracker port on mw2217 is CRITICAL: Timeout while attempting connection [07:10:32] PROBLEM - Check size of conntrack table on mw2222 is CRITICAL: Timeout while attempting connection [07:10:32] PROBLEM - Check size of conntrack table on mw2220 is CRITICAL: Timeout while attempting connection [07:10:43] PROBLEM - Apache HTTP on mw1267 is CRITICAL: Connection timed out [07:10:52] PROBLEM - configured eth on mw1267 is CRITICAL: Timeout while attempting connection [07:10:52] PROBLEM - Disk space on mw1265 is CRITICAL: Timeout while attempting connection [07:10:52] PROBLEM - DPKG on mw2221 is CRITICAL: Timeout while attempting connection [07:10:53] PROBLEM - DPKG on mw2220 is CRITICAL: Timeout while attempting connection [07:10:53] PROBLEM - DPKG on mw2222 is CRITICAL: Timeout while attempting connection [07:10:53] PROBLEM - nutcracker process on mw2217 is CRITICAL: Timeout while attempting connection [07:11:04] PROBLEM - MD RAID on mw1265 is CRITICAL: Timeout while attempting connection [07:11:12] PROBLEM - dhclient process on mw1267 is CRITICAL: Timeout while attempting connection [07:11:12] PROBLEM - puppet last run on mw2217 is CRITICAL: Timeout while attempting connection [07:11:12] PROBLEM - Disk space on mw2220 is CRITICAL: Timeout while attempting connection [07:11:12] PROBLEM - Disk space on mw2221 is CRITICAL: Timeout while attempting connection [07:11:12] PROBLEM - Disk space on mw2222 is CRITICAL: Timeout while attempting connection [07:11:13] PROBLEM - mediawiki-installation DSH group on mw1267 is CRITICAL: Host mw1267 is not in mediawiki-installation dsh group [07:11:23] PROBLEM - salt-minion processes on mw2217 is CRITICAL: Timeout while attempting connection [07:11:23] PROBLEM - MD RAID on mw2220 is CRITICAL: Timeout while attempting connection [07:11:23] PROBLEM - MD RAID on mw2221 is CRITICAL: Timeout while attempting connection [07:11:23] PROBLEM - MD RAID on mw2222 is CRITICAL: Timeout while attempting connection [07:11:42] PROBLEM - nutcracker port on mw1267 is CRITICAL: Timeout while attempting connection [07:11:53] PROBLEM - nutcracker process on mw1267 is CRITICAL: Timeout while attempting connection [07:11:54] PROBLEM - configured eth on mw1265 is CRITICAL: Timeout while attempting connection [07:12:02] PROBLEM - Check size of conntrack table on mw2217 is CRITICAL: Timeout while attempting connection [07:12:12] PROBLEM - configured eth on mw2221 is CRITICAL: Connection refused by host [07:12:13] PROBLEM - DPKG on mw2217 is CRITICAL: Connection refused by host [07:12:13] RECOVERY - Apache HTTP on mw2221 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.076 second response time [07:12:14] PROBLEM - puppet last run on mw1267 is CRITICAL: Timeout while attempting connection [07:12:14] PROBLEM - dhclient process on mw1265 is CRITICAL: Timeout while attempting connection [07:12:14] PROBLEM - configured eth on mw2220 is CRITICAL: Timeout while attempting connection [07:12:14] PROBLEM - configured eth on mw2222 is CRITICAL: Timeout while attempting connection [07:12:23] PROBLEM - mediawiki-installation DSH group on mw1265 is CRITICAL: Host mw1265 is not in mediawiki-installation dsh group [07:12:23] PROBLEM - dhclient process on mw2221 is CRITICAL: Connection refused by host [07:12:23] PROBLEM - Disk space on mw2217 is CRITICAL: Connection refused by host [07:12:32] PROBLEM - salt-minion processes on mw1267 is CRITICAL: Timeout while attempting connection [07:12:32] PROBLEM - dhclient process on mw2222 is CRITICAL: Timeout while attempting connection [07:12:33] PROBLEM - dhclient process on mw2220 is CRITICAL: Timeout while attempting connection [07:12:42] PROBLEM - mediawiki-installation DSH group on mw2220 is CRITICAL: Host mw2220 is not in mediawiki-installation dsh group [07:12:42] PROBLEM - mediawiki-installation DSH group on mw2221 is CRITICAL: Host mw2221 is not in mediawiki-installation dsh group [07:12:42] PROBLEM - mediawiki-installation DSH group on mw2222 is CRITICAL: Host mw2222 is not in mediawiki-installation dsh group [07:12:42] PROBLEM - MD RAID on mw2217 is CRITICAL: Connection refused by host [07:12:43] PROBLEM - nutcracker port on mw1265 is CRITICAL: Timeout while attempting connection [07:12:44] =============== These are all new installs ========================== [07:13:01] ================================================================== [07:13:33] RECOVERY - Apache HTTP on mw2217 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.079 second response time [07:13:43] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.007 second response time [07:14:52] RECOVERY - DPKG on mw2221 is OK: All packages OK [07:14:53] RECOVERY - nutcracker port on mw2221 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [07:15:03] RECOVERY - Disk space on mw2221 is OK: DISK OK [07:15:22] RECOVERY - nutcracker process on mw2221 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [07:15:23] RECOVERY - salt-minion processes on mw2217 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:15:23] RECOVERY - MD RAID on mw2221 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:15:43] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.100 second response time [07:15:43] RECOVERY - configured eth on mw2217 is OK: OK - interfaces up [07:15:53] RECOVERY - Check size of conntrack table on mw2217 is OK: OK: nf_conntrack is 0 % full [07:16:02] RECOVERY - salt-minion processes on mw2221 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:16:03] RECOVERY - dhclient process on mw2217 is OK: PROCS OK: 0 processes with command name dhclient [07:16:12] RECOVERY - configured eth on mw2221 is OK: OK - interfaces up [07:16:24] RECOVERY - dhclient process on mw2221 is OK: PROCS OK: 0 processes with command name dhclient [07:16:24] RECOVERY - nutcracker port on mw2217 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [07:16:24] RECOVERY - Check size of conntrack table on mw2221 is OK: OK: nf_conntrack is 0 % full [07:16:24] RECOVERY - Disk space on mw2217 is OK: DISK OK [07:16:42] RECOVERY - Disk space on lithium is OK: DISK OK [07:16:43] RECOVERY - MD RAID on mw2217 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:16:52] RECOVERY - nutcracker process on mw2217 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [07:17:32] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 639 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5596099 keys - replication_delay is 639 [07:18:13] RECOVERY - DPKG on mw2217 is OK: All packages OK [07:19:12] PROBLEM - puppet last run on mw2217 is CRITICAL: CRITICAL: puppet fail [07:19:44] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: puppet fail [07:20:02] (03PS3) 10JanZerebecki: Allow firejail --join=* for sc-admin [puppet] - 10https://gerrit.wikimedia.org/r/293510 (https://phabricator.wikimedia.org/T137412) [07:21:33] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5561419 keys - replication_delay is 0 [07:21:52] (03CR) 10JanZerebecki: "Fixed the typo." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293510 (https://phabricator.wikimedia.org/T137412) (owner: 10JanZerebecki) [07:21:53] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [07:28:03] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.596 second response time [07:30:04] I found the following for --^ [07:30:05] Jun 10 06:25:31 labmon1001 uwsgi-graphite-web[52949]: IOError: write error [07:30:42] Jun 10 06:25:31 labmon1001 uwsgi-graphite-web[52949]: Fri Jun 10 06:25:31 2016 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /render?format=json&from=-10min&target=tools.tools-webgrid-lighttpd-1408.puppetagent.time_since_last_run (ip 10.68.16.210) [07:30:54] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:35:42] checking other logs [07:35:58] PROBLEM - DPKG on mw2218 is CRITICAL: Connection refused by host [07:36:27] PROBLEM - Disk space on mw2218 is CRITICAL: Connection refused by host [07:36:39] PROBLEM - MD RAID on mw2218 is CRITICAL: Connection refused by host [07:37:28] PROBLEM - configured eth on mw2218 is CRITICAL: Connection refused by host [07:37:48] PROBLEM - dhclient process on mw2218 is CRITICAL: Connection refused by host [07:38:08] PROBLEM - mediawiki-installation DSH group on mw2218 is CRITICAL: Host mw2218 is not in mediawiki-installation dsh group [07:38:27] PROBLEM - nutcracker port on mw2218 is CRITICAL: Connection refused by host [07:38:38] PROBLEM - nutcracker process on mw2218 is CRITICAL: Connection refused by host [07:38:57] PROBLEM - puppet last run on mw2218 is CRITICAL: Connection refused by host [07:39:15] new hosts ---^ [07:39:17] PROBLEM - salt-minion processes on mw2218 is CRITICAL: Connection refused by host [07:39:48] PROBLEM - Check size of conntrack table on mw2218 is CRITICAL: Connection refused by host [07:43:08] PROBLEM - Apache HTTP on mw2217 is CRITICAL: Connection refused [07:45:07] PROBLEM - NTP on mw1265 is CRITICAL: NTP CRITICAL: No response from NTP server [07:45:48] PROBLEM - Apache HTTP on mw2218 is CRITICAL: Connection timed out [07:46:57] PROBLEM - Apache HTTP on mw2221 is CRITICAL: Connection refused [07:47:17] PROBLEM - NTP on mw1267 is CRITICAL: NTP CRITICAL: No response from NTP server [07:48:08] RECOVERY - Apache HTTP on mw2220 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.074 second response time [07:48:43] jynus: hey, is it possible to merge this patch? https://gerrit.wikimedia.org/r/#/c/292516/ It needed a deployment of ores which I'm doing it right now [07:48:48] or anyone in ops [07:48:54] it would be great, thanks [07:48:58] PROBLEM - NTP on mw2220 is CRITICAL: NTP CRITICAL: Offset unknown [07:49:19] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.003 second response time [07:49:28] RECOVERY - configured eth on mw2220 is OK: OK - interfaces up [07:49:48] RECOVERY - dhclient process on mw2220 is OK: PROCS OK: 0 processes with command name dhclient [07:50:09] PROBLEM - NTP on mw2222 is CRITICAL: NTP CRITICAL: No response from NTP server [07:50:18] RECOVERY - nutcracker port on mw2220 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [07:50:27] RECOVERY - Check size of conntrack table on mw2220 is OK: OK: nf_conntrack is 0 % full [07:50:38] RECOVERY - Disk space on mw2220 is OK: DISK OK [07:50:39] RECOVERY - salt-minion processes on mw2220 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:51:08] RECOVERY - nutcracker process on mw2220 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [07:51:08] RECOVERY - MD RAID on mw2220 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:51:58] RECOVERY - DPKG on mw2220 is OK: All packages OK [07:52:58] RECOVERY - dhclient process on mw1265 is OK: PROCS OK: 0 processes with command name dhclient [07:53:18] RECOVERY - nutcracker port on mw1265 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [07:53:18] RECOVERY - Apache HTTP on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.075 second response time [07:53:29] RECOVERY - salt-minion processes on mw1265 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:53:38] RECOVERY - Check size of conntrack table on mw1265 is OK: OK: nf_conntrack is 0 % full [07:53:49] RECOVERY - Disk space on mw1265 is OK: DISK OK [07:53:59] RECOVERY - nutcracker process on mw1265 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [07:54:17] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:54:28] RECOVERY - MD RAID on mw1265 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:54:29] RECOVERY - Apache HTTP on mw1267 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.008 second response time [07:54:48] RECOVERY - configured eth on mw1265 is OK: OK - interfaces up [07:55:07] RECOVERY - dhclient process on mw2222 is OK: PROCS OK: 0 processes with command name dhclient [07:55:08] RECOVERY - nutcracker process on mw2222 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [07:55:09] RECOVERY - MD RAID on mw2222 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:55:38] RECOVERY - Check size of conntrack table on mw2222 is OK: OK: nf_conntrack is 0 % full [07:55:58] RECOVERY - salt-minion processes on mw2222 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:56:08] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 1.123 second response time [07:56:27] RECOVERY - nutcracker port on mw2222 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [07:56:49] RECOVERY - Disk space on mw2222 is OK: DISK OK [07:56:49] RECOVERY - configured eth on mw2222 is OK: OK - interfaces up [07:57:28] RECOVERY - NTP on mw1267 is OK: NTP OK: Offset 0.07312917709 secs [07:57:59] RECOVERY - DPKG on mw1265 is OK: All packages OK [07:58:08] RECOVERY - DPKG on mw2222 is OK: All packages OK [07:58:48] RECOVERY - dhclient process on mw1267 is OK: PROCS OK: 0 processes with command name dhclient [07:58:58] RECOVERY - salt-minion processes on mw1267 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:58:58] RECOVERY - Disk space on mw1267 is OK: DISK OK [07:58:58] !log refilling ttmserver index on all ttm enabled wikis [07:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:59:17] RECOVERY - MD RAID on mw1267 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:59:18] RECOVERY - DPKG on mw1267 is OK: All packages OK [07:59:58] RECOVERY - nutcracker port on mw1267 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [08:00:07] RECOVERY - configured eth on mw1267 is OK: OK - interfaces up [08:00:17] RECOVERY - Check size of conntrack table on mw1267 is OK: OK: nf_conntrack is 0 % full [08:00:18] RECOVERY - nutcracker process on mw1267 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [08:01:29] !log deploying 38df031 into scb100[12] for ores service. Expecting some down time [08:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:06:15] PROBLEM - mediawiki-installation DSH group on mw2223 is CRITICAL: Host mw2223 is not in mediawiki-installation dsh group [08:06:36] PROBLEM - nutcracker port on mw2223 is CRITICAL: Connection refused by host [08:06:45] RECOVERY - NTP on mw2220 is OK: NTP OK: Offset -0.004486083984 secs [08:07:05] PROBLEM - nutcracker process on mw2223 is CRITICAL: Connection refused by host [08:07:16] PROBLEM - puppet last run on mw2223 is CRITICAL: Connection refused by host [08:07:35] RECOVERY - NTP on mw1265 is OK: NTP OK: Offset -0.001380324364 secs [08:07:35] PROBLEM - salt-minion processes on mw2223 is CRITICAL: Connection refused by host [08:07:47] RECOVERY - Apache HTTP on mw2218 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.074 second response time [08:08:15] PROBLEM - Check size of conntrack table on mw2223 is CRITICAL: Connection refused by host [08:08:26] PROBLEM - DPKG on mw2223 is CRITICAL: Connection refused by host [08:08:45] PROBLEM - Disk space on mw2223 is CRITICAL: Connection refused by host [08:09:05] PROBLEM - MD RAID on mw2223 is CRITICAL: Connection refused by host [08:09:36] PROBLEM - ores on scb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.006 second response time [08:09:56] PROBLEM - configured eth on mw2223 is CRITICAL: Connection refused by host [08:10:05] PROBLEM - NTP on mw2218 is CRITICAL: NTP CRITICAL: No response from NTP server [08:10:15] PROBLEM - dhclient process on mw2223 is CRITICAL: Connection refused by host [08:10:16] PROBLEM - ores on scb2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.074 second response time [08:11:04] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: add Ladsgroup to nda LDAP group (was: Grant graphite.wikimedia.org rights to grafana-admin LDAP group) - https://phabricator.wikimedia.org/T137373#2370833 (10jcrespo) Why is graphite access needed? [08:11:05] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1002.eqiad.wmnet because of too many down! [08:11:11] PROBLEM - LVS HTTP IPv4 on ores.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 136 bytes in 0.007 second response time [08:11:26] PROBLEM - Apache HTTP on mw2220 is CRITICAL: Connection refused [08:11:35] PROBLEM - ores on scb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.001 second response time [08:11:36] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1002.eqiad.wmnet because of too many down! [08:11:47] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1002.eqiad.wmnet because of too many down! [08:12:00] ouch ores seems down [08:12:05] RECOVERY - NTP on mw2222 is OK: NTP OK: Offset -0.00839650631 secs [08:12:23] uhm, what's up? [08:12:36] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [08:12:38] Amir1: are you deploying ores? [08:13:15] akosiaris: here? [08:13:25] elukey: yup [08:13:30] I'm fixing it [08:13:34] super :) [08:13:35] thanks [08:13:39] that's a known issue [08:13:48] https://phabricator.wikimedia.org/T137524 [08:13:51] what is? [08:13:56] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] [08:14:03] every time we deploy, we have a down time [08:14:11] until puppet makes the config files again [08:14:53] then let's not deploy until this gets fixed [08:15:07] 5xx are all for ores [08:15:10] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Port Zuul package 2.1.0-95-g66c8e52 from Precise to Jessie - https://phabricator.wikimedia.org/T137279#2370842 (10hashar) [08:15:16] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Puppet has 2 failures [08:15:29] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Port Zuul package 2.1.0-95-g66c8e52 from Precise to Jessie - https://phabricator.wikimedia.org/T137279#2363409 (10hashar) Pending deployment. [08:15:37] the mw criticals are me and papaul reimaging new appservers [08:16:15] PROBLEM - Apache HTTP on mw2223 is CRITICAL: Connection timed out [08:16:34] but ores is creating 6860 5xx/minute, that is a lot [08:16:45] we have a script somewhere that sets the host in scheduled downtime [08:17:11] Amir1: status? [08:17:20] paravoid: i was about to ask, didn't know if there was a way to silence hosts not in icinga [08:17:22] they are "500" [08:17:31] working on it [08:17:47] (i mean, there are now in there but not before the reimage) [08:17:48] paravoid: can you run the puppet agent manually in scb100[12]? [08:18:06] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Puppet has 2 failures [08:18:06] PROBLEM - Apache HTTP on mw1265 is CRITICAL: Connection refused [08:18:14] Amir1: doing it [08:18:29] tbab [08:18:29] thanks [08:18:36] thanks elukey :) [08:18:48] this should also fix it [08:18:49] https://gerrit.wikimedia.org/r/#/c/292516/ [08:18:56] the long term solution :) [08:19:17] we came to this conclusion after lots of talks [08:19:56] PROBLEM - Apache HTTP on mw1267 is CRITICAL: Connection refused [08:20:02] --^ me [08:20:09] be back in a few mintues [08:20:15] I need to catch my plane [08:20:15] Amir1: done! [08:20:20] thanks! [08:20:34] still 500 [08:20:49] the labs setup works perfectly fine [08:21:43] 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#2370849 (10jcrespo) This is not a problem with the servers, the query planner, or the indexing: ``` MariaDB... [08:21:49] should it need a restart by any chance/ [08:21:49] ? [08:22:03] I saw a celery refresh after puppet run [08:22:08] but not familiar with ores :) [08:22:40] we have a script somewhere that sets the host in scheduled downtime let me try that [08:23:46] elukey: can you do it [08:23:49] p858snake|L_: ah nice! But does it work also for hosts not in icinga yet? [08:23:50] two services [08:23:53] ores-uwsgi [08:23:54] sure [08:24:04] and celery-ores-worker [08:24:11] I think they are down [08:24:11] maybe I'll try with ores-uwsgi, the workers should be fine [08:24:28] be back in a few mins [08:24:30] * Amir1 runs [08:24:33] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2370858 (10hashar) [08:25:00] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2370859 (10hashar) [08:25:36] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [08:25:40] elukey: ask the person that wrote it? But i would assume icinga would need to know about a box before it can mark it down [08:25:41] RECOVERY - LVS HTTP IPv4 on ores.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2644 bytes in 0.010 second response time [08:25:55] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2223557 (10hashar) [08:25:57] RECOVERY - ores on scb1002 is OK: HTTP OK: HTTP/1.0 200 OK - 2833 bytes in 0.016 second response time [08:26:03] \o/ [08:26:06] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [08:26:15] RECOVERY - ores on scb1001 is OK: HTTP OK: HTTP/1.0 200 OK - 2833 bytes in 0.008 second response time [08:26:20] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2223557 (10hashar) [08:26:25] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [08:26:27] p858snake|L_: thanks, I know that guy :) [08:27:28] !log restarted uwsgi-ores (after a deployment + puppet run) - service was down [08:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:28:11] elukey: thanks elukey [08:28:29] the good news, I won't miss the flight [08:28:31] anyway [08:28:54] have a good flight :) [08:28:59] were you rushing the deploy because you were going to miss a flight? [08:29:08] oh no [08:29:19] I wanted to depoy [08:29:30] but it took much more than what I anticipated [08:29:47] I would suggest no deploys on fridays [08:30:04] also not before catching planes :P [08:30:13] I had 2.5 hours [08:30:20] Amir1: you were not supposed to deploy until https://gerrit.wikimedia.org/r/#/c/292516/ was merged [08:30:24] that should be enough [08:30:40] akosiaris: actually the plan was the other way around [08:30:54] nope [08:30:56] I talked to you, that's why you +1 it and not +2 it [08:31:11] I was waiting for you to be ready so we can coordinate it [08:31:12] or that was my impression :) [08:31:33] next time I will be noting it in the change [08:31:47] sorry for misunderstanding [08:31:59] I think he should get some context [08:32:04] that he may be missing [08:32:12] jynus: yup, I agree [08:32:20] Amir1, you sent 2 sms to all ops [08:32:32] oh, I'm soo sorry for that [08:32:48] I think we need to fix icinga [08:32:59] fix what ? [08:33:14] for now, because we are still testing with ores.wm.or and not used it anywhere [08:33:31] but I think tests will be finished soon [08:33:32] it is used in fawiki IIRC [08:33:41] ? [08:33:44] and that was for the test [08:33:49] not the extension [08:33:55] only one gadget [08:34:01] has fawiki created tables on production, Amir1 ? [08:34:02] well, they still are 64 end users [08:34:08] jynus: no no [08:34:12] jynus: no, it's gadgets [08:34:16] it's about something else [08:34:22] ah, because that really needs to be blocked by me [08:34:34] jynus: yeah, it would be great if you review it [08:35:23] akosiaris: yeah, it was a gradual deployment [08:35:31] it won't happen again [08:35:46] so, in conclustion, either let's not do deployments in the current state, or not without supervision [08:36:46] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: add Ladsgroup to nda LDAP group (was: Grant graphite.wikimedia.org rights to grafana-admin LDAP group) - https://phabricator.wikimedia.org/T137373#2370886 (10Ladsgroup) to build new dashboards I need to know what metrics they are sending... [08:36:57] jynus: sure [08:36:59] point taken [08:37:01] o/ [08:37:06] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:37:46] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:42:32] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: add Ladsgroup to nda LDAP group (was: Grant graphite.wikimedia.org rights to grafana-admin LDAP group) - https://phabricator.wikimedia.org/T137373#2370908 (10JanZerebecki) I'm sorry I skipped that question. [08:43:15] PROBLEM - Apache HTTP on mw2222 is CRITICAL: Connection refused [08:45:15] PROBLEM - NTP on mw2223 is CRITICAL: NTP CRITICAL: No response from NTP server [08:47:02] (03CR) 10Hashar: [C: 04-1] "Would cause puppet to fails on Precise instances which we still uses. Easiest way would be to harness with os_version() , else we can pr" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293273 (https://phabricator.wikimedia.org/T133779) (owner: 10Ottomata) [08:47:54] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: add Ladsgroup to nda LDAP group (was: Grant graphite.wikimedia.org rights to grafana-admin LDAP group) - https://phabricator.wikimedia.org/T137373#2370917 (10jcrespo) > I need to know what metrics they are sending it and it's really hard... [08:48:55] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 2070 MB (3% inode=96%) [08:55:08] (03PS1) 10Muehlenhoff: Enable backup for gallium [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) [08:55:43] (03Abandoned) 10Muehlenhoff: Enable backup for contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/293295 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [08:58:39] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2370970 (10MoritzMuehlenhoff) I have tested the 2.4.41 packages in vagrant with a syncrepl setup and seems fine. Update will happen next week, not really something for a Friday... [09:00:25] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:03:14] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.061 second response time [09:03:35] RECOVERY - Apache HTTP on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.075 second response time [09:03:45] RECOVERY - Apache HTTP on mw1267 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.074 second response time [09:03:57] (03PS2) 10Ema: varnishlog4: default to request grouping [puppet] - 10https://gerrit.wikimedia.org/r/293530 (https://phabricator.wikimedia.org/T137114) [09:04:14] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.012 second response time [09:04:34] (03CR) 10Ema: [C: 032 V: 032] varnishlog4: default to request grouping [puppet] - 10https://gerrit.wikimedia.org/r/293530 (https://phabricator.wikimedia.org/T137114) (owner: 10Ema) [09:04:45] PROBLEM - Apache HTTP on mw1268 is CRITICAL: Connection timed out [09:05:02] --^ me bootstrapping [09:05:06] PROBLEM - mediawiki-installation DSH group on mw1268 is CRITICAL: Host mw1268 is not in mediawiki-installation dsh group [09:05:14] RECOVERY - NTP on mw2218 is OK: NTP OK: Offset -0.07374298573 secs [09:05:20] (03PS3) 10Ema: varnishlog4.py: log errors in execute() [puppet] - 10https://gerrit.wikimedia.org/r/293123 (https://phabricator.wikimedia.org/T137114) [09:05:25] (03CR) 10Hashar: [C: 031] "Neat thanks! I have no idea what is going to be the impact of saving the thousands and thousands of Jenkins files under /var/lib/jenkins/ " [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [09:05:35] (03CR) 10Ema: [C: 032 V: 032] varnishlog4.py: log errors in execute() [puppet] - 10https://gerrit.wikimedia.org/r/293123 (https://phabricator.wikimedia.org/T137114) (owner: 10Ema) [09:05:36] RECOVERY - DPKG on mw2223 is OK: All packages OK [09:05:36] RECOVERY - nutcracker process on mw2223 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [09:05:44] PROBLEM - nutcracker port on mw1268 is CRITICAL: Timeout while attempting connection [09:05:54] RECOVERY - Disk space on mw2223 is OK: DISK OK [09:06:04] PROBLEM - nutcracker process on mw1268 is CRITICAL: Timeout while attempting connection [09:06:14] RECOVERY - nutcracker port on mw2223 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [09:06:15] RECOVERY - MD RAID on mw2223 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:06:15] RECOVERY - salt-minion processes on mw2223 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:06:24] PROBLEM - puppet last run on mw1268 is CRITICAL: Timeout while attempting connection [09:06:44] PROBLEM - salt-minion processes on mw1268 is CRITICAL: Timeout while attempting connection [09:06:54] RECOVERY - dhclient process on mw2223 is OK: PROCS OK: 0 processes with command name dhclient [09:06:54] RECOVERY - Check size of conntrack table on mw2223 is OK: OK: nf_conntrack is 0 % full [09:07:05] RECOVERY - configured eth on mw2223 is OK: OK - interfaces up [09:07:15] PROBLEM - Check size of conntrack table on mw1268 is CRITICAL: Timeout while attempting connection [09:07:34] PROBLEM - DPKG on mw1268 is CRITICAL: Timeout while attempting connection [09:07:54] PROBLEM - Disk space on mw1268 is CRITICAL: Timeout while attempting connection [09:08:06] PROBLEM - MD RAID on mw1268 is CRITICAL: Timeout while attempting connection [09:08:15] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2371001 (10Gehel) some reverse engineering has already been done by @jcrespo, documented on [[ https://wikitech.wikimedia.org/wiki/MariaDB/Sanitarium_and_Labsdbs... [09:08:55] PROBLEM - configured eth on mw1268 is CRITICAL: Timeout while attempting connection [09:09:14] PROBLEM - dhclient process on mw1268 is CRITICAL: Timeout while attempting connection [09:10:59] (03PS1) 10Elukey: Add new appservers mw126[57] to the Mediawiki DSH scap list. [puppet] - 10https://gerrit.wikimedia.org/r/293692 [09:11:49] (03CR) 10Elukey: [C: 032] Add new appservers mw126[57] to the Mediawiki DSH scap list. [puppet] - 10https://gerrit.wikimedia.org/r/293692 (owner: 10Elukey) [09:19:51] (03CR) 10Alexandros Kosiaris: [C: 032] ores: move config file to /etc/ores [puppet] - 10https://gerrit.wikimedia.org/r/292516 (owner: 10Ladsgroup) [09:19:56] (03PS2) 10Alexandros Kosiaris: ores: move config file to /etc/ores [puppet] - 10https://gerrit.wikimedia.org/r/292516 (owner: 10Ladsgroup) [09:20:02] (03CR) 10Alexandros Kosiaris: [V: 032] ores: move config file to /etc/ores [puppet] - 10https://gerrit.wikimedia.org/r/292516 (owner: 10Ladsgroup) [09:20:55] RECOVERY - ores on scb2001 is OK: HTTP OK: HTTP/1.0 200 OK - 2833 bytes in 0.093 second response time [09:22:04] !log restarted uwsgi-ores on scb200[12] as deployment follow up [09:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:22:55] RECOVERY - NTP on mw2223 is OK: NTP OK: Offset -0.007131099701 secs [09:24:28] (03CR) 10Muehlenhoff: [C: 031] "+1 on the patch itself, but needs approval from ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/293510 (https://phabricator.wikimedia.org/T137412) (owner: 10JanZerebecki) [09:25:16] (03PS1) 10Alexandros Kosiaris: service::uwsgi: Ensure config directory exists [puppet] - 10https://gerrit.wikimedia.org/r/293694 [09:26:32] (03CR) 10jenkins-bot: [V: 04-1] service::uwsgi: Ensure config directory exists [puppet] - 10https://gerrit.wikimedia.org/r/293694 (owner: 10Alexandros Kosiaris) [09:29:03] (03PS2) 10Alexandros Kosiaris: service::uwsgi: Ensure config directory exists [puppet] - 10https://gerrit.wikimedia.org/r/293694 [09:30:17] (03CR) 10jenkins-bot: [V: 04-1] service::uwsgi: Ensure config directory exists [puppet] - 10https://gerrit.wikimedia.org/r/293694 (owner: 10Alexandros Kosiaris) [09:35:33] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2371014 (10jcrespo) It is scheduled. It is difficult to give an estimation, but it can be done after enwiki is finished, so 3-6 months? Labs hosts, by its own nature cannot and will probably not be 100% ever... [09:36:46] RECOVERY - salt-minion processes on mw2218 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:37:05] RECOVERY - MD RAID on mw2218 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:37:16] RECOVERY - Disk space on mw2218 is OK: DISK OK [09:37:45] RECOVERY - nutcracker process on mw2218 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [09:37:46] RECOVERY - configured eth on mw2218 is OK: OK - interfaces up [09:37:55] RECOVERY - Check size of conntrack table on mw2218 is OK: OK: nf_conntrack is 0 % full [09:38:06] RECOVERY - dhclient process on mw2218 is OK: PROCS OK: 0 processes with command name dhclient [09:38:26] RECOVERY - nutcracker port on mw2218 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [09:40:36] RECOVERY - Apache HTTP on mw1268 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.019 second response time [09:42:36] RECOVERY - DPKG on mw2218 is OK: All packages OK [09:43:46] RECOVERY - DPKG on mw1268 is OK: All packages OK [09:44:06] RECOVERY - Disk space on mw1268 is OK: DISK OK [09:44:26] RECOVERY - MD RAID on mw1268 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:44:26] RECOVERY - salt-minion processes on mw1268 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:44:45] PROBLEM - NTP on mw1268 is CRITICAL: NTP CRITICAL: Offset unknown [09:44:56] RECOVERY - dhclient process on mw1268 is OK: PROCS OK: 0 processes with command name dhclient [09:45:06] RECOVERY - configured eth on mw1268 is OK: OK - interfaces up [09:45:15] RECOVERY - nutcracker port on mw1268 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [09:45:27] RECOVERY - nutcracker process on mw1268 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [09:45:45] RECOVERY - Check size of conntrack table on mw1268 is OK: OK: nf_conntrack is 0 % full [09:53:26] PROBLEM - Apache HTTP on mw2223 is CRITICAL: Connection refused [09:54:36] RECOVERY - Disk space on ms-be2012 is OK: DISK OK [09:55:08] 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#2371028 (10daniel) @jcrespo @hoo: ick, 14 million rows? And this isn't optimized away because of the DISTINC... [09:56:11] 06Operations: ms-be2012 ran out of disk space - https://phabricator.wikimedia.org/T137397#2367102 (10jcrespo) I've been routinely archiving to /srv syslog.1 to temporarily fix the disk issues. This has been logged several times: ``` 08:38 jynus: archiving again syslog.1 from ms-be2012 on /srv/swift-storage/sd... [09:58:27] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.006 second response time [09:59:15] RECOVERY - NTP on mw1268 is OK: NTP OK: Offset 0.003390908241 secs [10:00:26] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Puppet has 2 failures [10:00:45] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.223 second response time [10:05:10] !log scb100x disabling puppet and stopping change-prop to look at zookeeper znodes [10:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:06:40] PROBLEM - Apache HTTP on mw1268 is CRITICAL: Connection refused [10:12:09] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [10:14:29] RECOVERY - mediawiki-installation DSH group on mw1267 is OK: OK [10:15:24] mw127[01] are bootstraping atm [10:15:30] RECOVERY - mediawiki-installation DSH group on mw1265 is OK: OK [10:15:33] *strapping [10:20:52] 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#2371094 (10jcrespo) @daniel I think this is a case of [[ http://ubiquity.acm.org/article.cfm?id=1513451 | pr... [10:21:20] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [10:22:13] (03PS1) 10Gergő Tisza: Fix logging config for authmanager metrics channel rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293701 [10:23:20] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:24:20] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [10:25:21] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 6.654 second response time [10:39:31] (03PS1) 10Ema: varnish{xcache,xcps,...}: subscribe to varnishlog.py [puppet] - 10https://gerrit.wikimedia.org/r/293705 [10:43:04] deploying a fix for AuthManager metrics [10:45:51] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.218 second response time [10:47:59] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.016 second response time [10:48:10] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:49:00] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [10:51:32] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2371192 (10Blahma) Thank you. I did not realize there were also SQL dumps, not only XML. Would it perhaps be possible to have the latest dump readily available on an SQL server? That could be an alternative fo... [10:51:55] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2371193 (10jcrespo) This is ok, just ping me today early or next week to get it done. [10:56:34] !log scb100x enabled puppet back [10:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:57:19] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [11:03:08] 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#2371207 (10daniel) @jcrespo so we should do one query per ID, with limit 1? ok! [11:03:40] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:06:51] !log tgr@tin Synchronized php-1.28.0-wmf.5/includes/specials/SpecialCreateAccount.php: deploy [[gerrit:293704]] to fix AuthManager metrics (duration: 00m 52s) [11:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:06:57] (03CR) 10Hashar: "/ shows up 250GBytes being used, most of it would be the Jenkins job build history and console logs under /var/lib/jenkins/jobs/*/builds. " [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [11:07:44] !log tgr@tin Synchronized php-1.28.0-wmf.5/includes/specials/SpecialUserLogin.php: deploy [[gerrit:293704]] to fix AuthManager metrics (duration: 00m 32s) [11:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:08:42] (03CR) 10BBlack: [C: 031] varnish{xcache,xcps,...}: subscribe to varnishlog.py [puppet] - 10https://gerrit.wikimedia.org/r/293705 (owner: 10Ema) [11:10:11] RECOVERY - Apache HTTP on mw1268 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.062 second response time [11:11:40] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 3.565 second response time [11:14:07] akosiaris: do we have separate repo for Beta to test? [11:14:20] RECOVERY - Apache HTTP on mw2217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.723 second response time [11:14:28] Specially Jessie migration of Apertium need some more testing. [11:17:39] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2371239 (10jcrespo) @Blahma what do you want me to do? I can load those files, but those would be out of sync as soon as they are imported, and impossible to get updated. You can load those tables to the same... [11:17:40] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.005 second response time [11:23:21] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.733 second response time [11:23:50] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 2.612 second response time [11:27:40] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.611 second response time [11:32:30] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.007 second response time [11:33:50] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.019 second response time [11:41:37] (03PS1) 10Muehlenhoff: Install firejail profile for convert [puppet] - 10https://gerrit.wikimedia.org/r/293710 (https://phabricator.wikimedia.org/T135111) [11:46:00] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.839 second response time [11:48:08] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.528 second response time [11:50:27] (03PS1) 10Muehlenhoff: Enable base::firewall for labtestmetal2001 [puppet] - 10https://gerrit.wikimedia.org/r/293712 [11:52:09] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [12:04:15] (03PS1) 10Elukey: Add new appserver mw1268 to the Mediawiki DSH scap list. [puppet] - 10https://gerrit.wikimedia.org/r/293715 [12:05:33] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2371339 (10Blahma) @jcrespo Thanks for staying on the constructive line. FYI, the output in question is https://cs.wikipedia.org/wiki/Wikipedie:%C3%9Adr%C5%BEba/Nekategorizovan%C3%A9_%C4%8Dl%C3%A1nky_s_ohledem... [12:07:25] (03CR) 10Elukey: [C: 032] Add new appserver mw1268 to the Mediawiki DSH scap list. [puppet] - 10https://gerrit.wikimedia.org/r/293715 (owner: 10Elukey) [12:19:09] (03PS1) 10Muehlenhoff: Enable base::firewall for osmium [puppet] - 10https://gerrit.wikimedia.org/r/293717 [12:25:07] (03PS2) 10Muehlenhoff: Install firejail profile for convert [puppet] - 10https://gerrit.wikimedia.org/r/293710 (https://phabricator.wikimedia.org/T135111) [12:26:32] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: add Ladsgroup to nda LDAP group (was: Grant graphite.wikimedia.org rights to grafana-admin LDAP group) - https://phabricator.wikimedia.org/T137373#2366461 (10Krenair) (Graphite isn't security-reviewed which is why it currently has the NDA... [12:31:16] (03CR) 10Muehlenhoff: [C: 032 V: 032] Install firejail profile for convert [puppet] - 10https://gerrit.wikimedia.org/r/293710 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff) [12:31:27] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:33:29] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 5.177 second response time [12:41:08] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [12:46:09] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2371458 (10jcrespo) This is very easy to fix- tell your users to mark those that are incorrect, and exclude them from your query- that is very easy to do and doesn't require waiting. [12:47:28] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:49:23] "Error: Could not find any host matching 'mw1270'" on icinga [12:49:29] probably a race condition [12:51:17] jynus: I had to manually add rootdelay to boot mw1270 because wmf-reimage was stuck [12:51:43] same thing for mw1269 and mw1271 [12:52:42] meanwhile, logstash1001 /var/log/logstash full of "LogStash::Json::ParserError: Unexpected end-of-input in VALUE_STRING" [12:58:23] seems fixed now, but I cannot rotate the logs [12:58:33] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup payments1005-8 - https://phabricator.wikimedia.org/T136881#2371492 (10faidon) [13:01:55] RECOVERY - Disk space on logstash1001 is OK: DISK OK [13:02:24] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [13:05:14] PROBLEM - configured eth on mw1269 is CRITICAL: Timeout while attempting connection [13:05:14] PROBLEM - configured eth on mw1270 is CRITICAL: Timeout while attempting connection [13:05:14] PROBLEM - configured eth on mw1271 is CRITICAL: Timeout while attempting connection [13:05:34] PROBLEM - Apache HTTP on mw1271 is CRITICAL: Connection timed out [13:05:34] PROBLEM - Apache HTTP on mw1270 is CRITICAL: Connection timed out [13:05:34] PROBLEM - Apache HTTP on mw1269 is CRITICAL: Connection timed out [13:05:44] PROBLEM - dhclient process on mw1269 is CRITICAL: Timeout while attempting connection [13:05:44] PROBLEM - dhclient process on mw1271 is CRITICAL: Timeout while attempting connection [13:05:44] PROBLEM - dhclient process on mw1270 is CRITICAL: Timeout while attempting connection [13:05:55] PROBLEM - mediawiki-installation DSH group on mw1269 is CRITICAL: Host mw1269 is not in mediawiki-installation dsh group [13:05:55] PROBLEM - mediawiki-installation DSH group on mw1270 is CRITICAL: Host mw1270 is not in mediawiki-installation dsh group [13:05:55] PROBLEM - mediawiki-installation DSH group on mw1271 is CRITICAL: Host mw1271 is not in mediawiki-installation dsh group [13:06:24] PROBLEM - nutcracker port on mw1269 is CRITICAL: Timeout while attempting connection [13:06:24] PROBLEM - nutcracker port on mw1271 is CRITICAL: Timeout while attempting connection [13:06:24] PROBLEM - nutcracker port on mw1270 is CRITICAL: Timeout while attempting connection [13:06:34] PROBLEM - nutcracker process on mw1271 is CRITICAL: Timeout while attempting connection [13:06:34] PROBLEM - nutcracker process on mw1269 is CRITICAL: Timeout while attempting connection [13:06:34] PROBLEM - nutcracker process on mw1270 is CRITICAL: Timeout while attempting connection [13:06:45] PROBLEM - puppet last run on mw1269 is CRITICAL: Timeout while attempting connection [13:06:45] PROBLEM - puppet last run on mw1270 is CRITICAL: Timeout while attempting connection [13:06:45] PROBLEM - puppet last run on mw1271 is CRITICAL: Timeout while attempting connection [13:06:50] here they are, new appservers :) [13:07:15] PROBLEM - salt-minion processes on mw1270 is CRITICAL: Timeout while attempting connection [13:07:15] PROBLEM - salt-minion processes on mw1269 is CRITICAL: Timeout while attempting connection [13:07:15] PROBLEM - salt-minion processes on mw1271 is CRITICAL: Timeout while attempting connection [13:07:21] can we do something to reduce the alert spam though? :) [13:07:54] PROBLEM - Check size of conntrack table on mw1269 is CRITICAL: Timeout while attempting connection [13:07:54] PROBLEM - Check size of conntrack table on mw1270 is CRITICAL: Timeout while attempting connection [13:07:54] PROBLEM - Check size of conntrack table on mw1271 is CRITICAL: Timeout while attempting connection [13:08:04] PROBLEM - DPKG on mw1271 is CRITICAL: Timeout while attempting connection [13:08:04] PROBLEM - DPKG on mw1269 is CRITICAL: Timeout while attempting connection [13:08:04] PROBLEM - DPKG on mw1270 is CRITICAL: Timeout while attempting connection [13:08:24] PROBLEM - Disk space on mw1271 is CRITICAL: Timeout while attempting connection [13:08:24] PROBLEM - Disk space on mw1270 is CRITICAL: Timeout while attempting connection [13:08:24] PROBLEM - Disk space on mw1269 is CRITICAL: Timeout while attempting connection [13:08:35] PROBLEM - MD RAID on mw1269 is CRITICAL: Timeout while attempting connection [13:08:35] PROBLEM - MD RAID on mw1270 is CRITICAL: Timeout while attempting connection [13:08:35] PROBLEM - MD RAID on mw1271 is CRITICAL: Timeout while attempting connection [13:09:04] RECOVERY - mediawiki-installation DSH group on mw1268 is OK: OK [13:12:28] 06Operations, 10ops-codfw, 10DBA: db2034 crash - https://phabricator.wikimedia.org/T137084#2371534 (10jcrespo) [13:12:30] 06Operations, 10ops-codfw, 10DBA: db2034 degraded RAID - https://phabricator.wikimedia.org/T136583#2371536 (10jcrespo) [13:12:59] !log Testing patched Cassandra (dpkg -i ...; service cassandra-a restart) on xenon : T137474 [13:13:00] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [13:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:14] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:15:02] !log Starting html dump(s) in RESTBase staging : T137474 [13:15:03] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [13:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:15:52] paravoid: as soon as they come up in icinga I can silence them to avoid any repetition, but not sure about how to remove spam completely :( [13:16:09] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2371555 (10MZMcBride) >>! In T126946#2371458, @jcrespo wrote: > This is very easy to fix- tell your users to mark those that are incorrect, and exclude them from your query- that is very easy to do and doesn't... [13:16:14] what do you mean? [13:17:22] I am not sure how to silence mw hosts that are not in icinga yet, but that appear only when they run puppet the first tine [13:17:25] *time [13:17:53] these aren't reimages, right? [13:18:43] yes, those are new new appservers [13:18:55] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2371577 (10jcrespo) > The latest "solution" to the constant stream of data integrity issues on Wikimedia Labs database replicas is to further inconvenience volunteers? No, the latest solution is the reimport... [13:19:40] (I am working on the eqiad ones and papaul on the codfw ones) [13:24:37] (03CR) 10Gehel: "Puppet compiler output: https://puppet-compiler.wmflabs.org/3082/ this is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/293540 (owner: 10MaxSem) [13:25:01] (03PS1) 10BBlack: VCL: move fe 503-retry to top of v3 vcl_error [puppet] - 10https://gerrit.wikimedia.org/r/293720 [13:25:03] (03PS1) 10BBlack: X-Cache: fix missing "int" cases, add "err", "bug" [puppet] - 10https://gerrit.wikimedia.org/r/293721 [13:25:08] (03PS3) 10Gehel: Move everything Postgres-related out of role::maps::server [puppet] - 10https://gerrit.wikimedia.org/r/293540 (owner: 10MaxSem) [13:27:51] (03CR) 10Gehel: [C: 032] Move everything Postgres-related out of role::maps::server [puppet] - 10https://gerrit.wikimedia.org/r/293540 (owner: 10MaxSem) [13:30:05] RECOVERY - Apache HTTP on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.008 second response time [13:31:26] (03PS3) 10Gehel: Configure proxy for HTTPS as well as HTTP in replicate-osm. [puppet] - 10https://gerrit.wikimedia.org/r/293477 (https://phabricator.wikimedia.org/T134901) [13:32:51] (03CR) 10Gehel: [C: 032] Configure proxy for HTTPS as well as HTTP in replicate-osm. [puppet] - 10https://gerrit.wikimedia.org/r/293477 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [13:33:21] RECOVERY - salt-minion processes on mw1270 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:33:21] RECOVERY - DPKG on mw1270 is OK: All packages OK [13:33:51] RECOVERY - configured eth on mw1270 is OK: OK - interfaces up [13:34:01] RECOVERY - Disk space on mw1270 is OK: DISK OK [13:34:02] RECOVERY - dhclient process on mw1270 is OK: PROCS OK: 0 processes with command name dhclient [13:34:03] RECOVERY - nutcracker port on mw1270 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:34:25] (03PS2) 10Gehel: Don't publish etags for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/293492 (https://phabricator.wikimedia.org/T137238) [13:34:31] RECOVERY - Check size of conntrack table on mw1270 is OK: OK: nf_conntrack is 0 % full [13:34:32] RECOVERY - nutcracker process on mw1270 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [13:34:51] RECOVERY - MD RAID on mw1270 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:35:53] (03CR) 10Gehel: [C: 032] Don't publish etags for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/293492 (https://phabricator.wikimedia.org/T137238) (owner: 10Gehel) [13:37:32] RECOVERY - Apache HTTP on mw1271 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.016 second response time [13:38:13] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Puppet has 1 failures [13:39:32] RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.019 second response time [13:41:42] RECOVERY - nutcracker port on mw1271 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:42:02] RECOVERY - nutcracker process on mw1269 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [13:42:02] RECOVERY - Disk space on mw1269 is OK: DISK OK [13:42:02] RECOVERY - nutcracker process on mw1271 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [13:42:21] RECOVERY - Disk space on mw1271 is OK: DISK OK [13:42:21] RECOVERY - MD RAID on mw1271 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:42:22] RECOVERY - nutcracker port on mw1269 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:42:22] RECOVERY - MD RAID on mw1269 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:42:41] RECOVERY - salt-minion processes on mw1269 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:42:41] RECOVERY - salt-minion processes on mw1271 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:42:51] RECOVERY - Check size of conntrack table on mw1271 is OK: OK: nf_conntrack is 0 % full [13:42:51] RECOVERY - Check size of conntrack table on mw1269 is OK: OK: nf_conntrack is 0 % full [13:42:52] there is also recovery spam [13:42:58] I am sorry folks :) [13:43:12] RECOVERY - configured eth on mw1271 is OK: OK - interfaces up [13:43:21] RECOVERY - configured eth on mw1269 is OK: OK - interfaces up [13:43:22] RECOVERY - dhclient process on mw1271 is OK: PROCS OK: 0 processes with command name dhclient [13:43:32] RECOVERY - dhclient process on mw1269 is OK: PROCS OK: 0 processes with command name dhclient [13:46:41] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:47:43] RECOVERY - DPKG on mw1271 is OK: All packages OK [13:47:43] RECOVERY - DPKG on mw1269 is OK: All packages OK [13:49:25] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2371668 (10JAllemandou) There still are warnings from webrequest sequence checks. **TL;DR** Luca's code solved the ordering issue, bu... [13:50:27] 06Operations, 10ops-eqiad: labsdb1001: Swap eth0 cable - https://phabricator.wikimedia.org/T137555#2371671 (10Andrew) [13:50:54] Testing patched Cassandra (dpkg -i ...; service cassandra-a restart) on cerium : T137474 [13:50:55] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [13:51:10] 06Operations: ms-be2012 ran out of disk space - https://phabricator.wikimedia.org/T137397#2371685 (10Dzahn) @jcrespo i restarted rsyslog after deleting the file, so i actually got free space from it after that [13:51:54] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2371687 (10JAllemandou) Proposed solution: If @elukey confirms that missing timestamps come from error in Varnish (timeout, connection... [13:51:59] 06Operations: ms-be2012 ran out of disk space - https://phabricator.wikimedia.org/T137397#2371688 (10jcrespo) Of course, it will be gone after restarting the process, but better if it not happens in the first place. [13:58:17] gah [13:58:27] !log Testing patched Cassandra (dpkg -i ...; service cassandra-a restart) on cerium : T137474 [13:58:28] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [13:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:59:54] !log Testing patched Cassandra (dpkg -i ...; service cassandra-a restart) on praseodymim : T137474 [13:59:56] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [13:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:25] (03CR) 10Anomie: [C: 031] "Looks sane." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293701 (owner: 10Gergő Tisza) [14:05:31] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:06:51] !log Testing patched Cassandra (dpkg -i ...; service cassandra-a restart) on restbase-test2001 : T137474 [14:06:52] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [14:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:21] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.394 second response time [14:07:31] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Puppet has 1 failures [14:09:52] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Puppet has 1 failures [14:16:22] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:43] !log Testing patched Cassandra (dpkg -i ...; service cassandra-{a,b} restart) on restbase-test200[1-2] : T137474 [14:17:44] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [14:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:29:22] (03PS1) 10Faidon Liambotis: cassandra: pin 2.2.6+wmf1 on staging [puppet] - 10https://gerrit.wikimedia.org/r/293730 [14:29:37] 06Operations, 10Ops-Access-Requests, 06Services, 13Patch-For-Review: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2371810 (10JanZerebecki) a:05JanZerebecki>03None [14:32:19] (03PS2) 10Faidon Liambotis: cassandra: pin 2.2.6-wmf1 on staging [puppet] - 10https://gerrit.wikimedia.org/r/293730 [14:33:26] (03PS3) 10Faidon Liambotis: cassandra: pin 2.2.6-wmf1 on staging [puppet] - 10https://gerrit.wikimedia.org/r/293730 [14:34:36] (03CR) 10Eevans: [C: 031] "LGTM; Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/293730 (owner: 10Faidon Liambotis) [14:35:16] (03CR) 10Faidon Liambotis: [C: 032] cassandra: pin 2.2.6-wmf1 on staging [puppet] - 10https://gerrit.wikimedia.org/r/293730 (owner: 10Faidon Liambotis) [14:35:40] PROBLEM - Apache HTTP on mw2226 is CRITICAL: Connection timed out [14:35:40] PROBLEM - Apache HTTP on mw2224 is CRITICAL: Connection timed out [14:35:40] PROBLEM - Apache HTTP on mw2225 is CRITICAL: Connection timed out [14:35:48] PROBLEM - dhclient process on mw2227 is CRITICAL: Timeout while attempting connection [14:35:48] PROBLEM - dhclient process on mw2224 is CRITICAL: Timeout while attempting connection [14:35:49] PROBLEM - dhclient process on mw2226 is CRITICAL: Timeout while attempting connection [14:35:49] PROBLEM - dhclient process on mw2225 is CRITICAL: Timeout while attempting connection [14:35:59] PROBLEM - mediawiki-installation DSH group on mw2225 is CRITICAL: Host mw2225 is not in mediawiki-installation dsh group [14:36:00] PROBLEM - mediawiki-installation DSH group on mw2224 is CRITICAL: Host mw2224 is not in mediawiki-installation dsh group [14:36:00] PROBLEM - mediawiki-installation DSH group on mw2226 is CRITICAL: Host mw2226 is not in mediawiki-installation dsh group [14:36:00] PROBLEM - mediawiki-installation DSH group on mw2227 is CRITICAL: Host mw2227 is not in mediawiki-installation dsh group [14:36:38] PROBLEM - nutcracker port on mw2227 is CRITICAL: Timeout while attempting connection [14:36:38] PROBLEM - nutcracker port on mw2226 is CRITICAL: Timeout while attempting connection [14:36:38] PROBLEM - nutcracker port on mw2225 is CRITICAL: Timeout while attempting connection [14:36:38] PROBLEM - nutcracker port on mw2224 is CRITICAL: Timeout while attempting connection [14:36:58] PROBLEM - nutcracker process on mw2227 is CRITICAL: Timeout while attempting connection [14:36:58] PROBLEM - nutcracker process on mw2224 is CRITICAL: Timeout while attempting connection [14:36:58] PROBLEM - nutcracker process on mw2226 is CRITICAL: Timeout while attempting connection [14:36:58] PROBLEM - nutcracker process on mw2225 is CRITICAL: Timeout while attempting connection [14:37:08] PROBLEM - puppet last run on mw2227 is CRITICAL: Timeout while attempting connection [14:37:08] PROBLEM - puppet last run on mw2226 is CRITICAL: Timeout while attempting connection [14:37:08] PROBLEM - puppet last run on mw2224 is CRITICAL: Timeout while attempting connection [14:37:08] PROBLEM - puppet last run on mw2225 is CRITICAL: Timeout while attempting connection [14:37:19] PROBLEM - puppet last run on restbase-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures [14:37:19] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Puppet has 1 failures [14:37:19] PROBLEM - salt-minion processes on mw2227 is CRITICAL: Timeout while attempting connection [14:37:19] PROBLEM - salt-minion processes on mw2225 is CRITICAL: Timeout while attempting connection [14:37:20] PROBLEM - salt-minion processes on mw2226 is CRITICAL: Timeout while attempting connection [14:37:20] PROBLEM - salt-minion processes on mw2224 is CRITICAL: Timeout while attempting connection [14:37:48] PROBLEM - Apache HTTP on mw2227 is CRITICAL: Connection timed out [14:38:08] PROBLEM - Check size of conntrack table on mw2227 is CRITICAL: Timeout while attempting connection [14:38:08] PROBLEM - Check size of conntrack table on mw2225 is CRITICAL: Timeout while attempting connection [14:38:08] PROBLEM - Check size of conntrack table on mw2226 is CRITICAL: Timeout while attempting connection [14:38:08] PROBLEM - Check size of conntrack table on mw2224 is CRITICAL: Timeout while attempting connection [14:38:08] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [14:38:19] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [14:38:28] PROBLEM - DPKG on mw2224 is CRITICAL: Timeout while attempting connection [14:38:28] PROBLEM - DPKG on mw2225 is CRITICAL: Timeout while attempting connection [14:38:28] PROBLEM - DPKG on mw2226 is CRITICAL: Timeout while attempting connection [14:38:28] PROBLEM - DPKG on mw2227 is CRITICAL: Timeout while attempting connection [14:38:40] PROBLEM - Disk space on mw2224 is CRITICAL: Timeout while attempting connection [14:38:40] PROBLEM - Disk space on mw2226 is CRITICAL: Timeout while attempting connection [14:38:40] PROBLEM - Disk space on mw2225 is CRITICAL: Timeout while attempting connection [14:38:40] PROBLEM - Disk space on mw2227 is CRITICAL: Timeout while attempting connection [14:38:59] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [14:38:59] PROBLEM - MD RAID on mw2227 is CRITICAL: Timeout while attempting connection [14:38:59] PROBLEM - MD RAID on mw2225 is CRITICAL: Timeout while attempting connection [14:38:59] PROBLEM - MD RAID on mw2226 is CRITICAL: Timeout while attempting connection [14:38:59] PROBLEM - MD RAID on mw2224 is CRITICAL: Timeout while attempting connection [14:39:29] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:39:39] PROBLEM - puppet last run on restbase-test2003 is CRITICAL: CRITICAL: Puppet has 1 failures [14:39:49] PROBLEM - configured eth on mw2224 is CRITICAL: Timeout while attempting connection [14:39:49] PROBLEM - configured eth on mw2226 is CRITICAL: Timeout while attempting connection [14:39:49] PROBLEM - configured eth on mw2225 is CRITICAL: Timeout while attempting connection [14:39:49] PROBLEM - configured eth on mw2227 is CRITICAL: Timeout while attempting connection [14:44:40] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [14:45:59] (03PS1) 10Muehlenhoff: Define ferm service dynamicproxy-api-http in role::labs::novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/293733 [14:48:45] (03CR) 10Andrew Bogott: [C: 031] Define ferm service dynamicproxy-api-http in role::labs::novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/293733 (owner: 10Muehlenhoff) [14:49:38] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [14:50:29] RECOVERY - Apache HTTP on mw2224 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.080 second response time [14:50:45] (03PS1) 10Elukey: Add mw1269 and mw127[01] to the Mediawiki scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/293735 [14:52:29] RECOVERY - configured eth on mw2224 is OK: OK - interfaces up [14:52:39] RECOVERY - Apache HTTP on mw2225 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.079 second response time [14:52:40] RECOVERY - dhclient process on mw2224 is OK: PROCS OK: 0 processes with command name dhclient [14:52:58] RECOVERY - Check size of conntrack table on mw2224 is OK: OK: nf_conntrack is 0 % full [14:53:29] RECOVERY - Disk space on mw2225 is OK: DISK OK [14:53:29] RECOVERY - Disk space on mw2224 is OK: DISK OK [14:53:38] RECOVERY - nutcracker port on mw2224 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:53:39] RECOVERY - nutcracker port on mw2225 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:53:50] RECOVERY - MD RAID on mw2224 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:53:50] RECOVERY - MD RAID on mw2225 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:53:51] RECOVERY - nutcracker process on mw2224 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:53:51] RECOVERY - nutcracker process on mw2225 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:54:18] RECOVERY - salt-minion processes on mw2224 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:54:18] RECOVERY - salt-minion processes on mw2225 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:54:38] RECOVERY - configured eth on mw2225 is OK: OK - interfaces up [14:54:48] RECOVERY - Apache HTTP on mw2226 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.076 second response time [14:54:49] RECOVERY - dhclient process on mw2225 is OK: PROCS OK: 0 processes with command name dhclient [14:54:59] (03CR) 10Elukey: [C: 032] Add mw1269 and mw127[01] to the Mediawiki scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/293735 (owner: 10Elukey) [14:55:00] RECOVERY - Check size of conntrack table on mw2225 is OK: OK: nf_conntrack is 0 % full [14:55:04] (03PS1) 10Urbanecm: Add permission to selfremove from ipblock-exempt group in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293737 (https://phabricator.wikimedia.org/T137532) [14:55:19] RECOVERY - DPKG on mw2225 is OK: All packages OK [14:55:19] RECOVERY - DPKG on mw2226 is OK: All packages OK [14:55:19] RECOVERY - DPKG on mw2224 is OK: All packages OK [14:55:39] RECOVERY - Disk space on mw2226 is OK: DISK OK [14:55:40] RECOVERY - nutcracker port on mw2226 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:55:59] RECOVERY - MD RAID on mw2226 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:56:00] RECOVERY - nutcracker process on mw2226 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:56:28] RECOVERY - salt-minion processes on mw2226 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:56:49] RECOVERY - configured eth on mw2226 is OK: OK - interfaces up [14:56:59] RECOVERY - dhclient process on mw2226 is OK: PROCS OK: 0 processes with command name dhclient [14:57:09] RECOVERY - Check size of conntrack table on mw2226 is OK: OK: nf_conntrack is 0 % full [14:58:19] (03PS2) 10Ema: varnish{xcache,xcps,...}: subscribe to varnishlog.py [puppet] - 10https://gerrit.wikimedia.org/r/293705 [14:58:40] (03CR) 10Ema: [C: 032 V: 032] varnish{xcache,xcps,...}: subscribe to varnishlog.py [puppet] - 10https://gerrit.wikimedia.org/r/293705 (owner: 10Ema) [14:59:19] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:01:28] if change-prop complains, that's me [15:01:29] RECOVERY - Apache HTTP on mw2227 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.076 second response time [15:01:40] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 2.327 second response time [15:02:19] andrewbogott: ^^ this has been going on for more than a day now [15:02:24] got anything to do with labs or sth? [15:02:28] RECOVERY - configured eth on mw2227 is OK: OK - interfaces up [15:02:36] RECOVERY - nutcracker port on mw2227 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [15:02:46] mobrovac: that's something yuvi has in progress, I'll bug him again when he gets here [15:02:47] RECOVERY - dhclient process on mw2227 is OK: PROCS OK: 0 processes with command name dhclient [15:02:56] RECOVERY - nutcracker process on mw2227 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [15:02:58] kk thnx andrewbogott [15:03:06] RECOVERY - puppet last run on restbase-test2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:03:06] RECOVERY - Check size of conntrack table on mw2227 is OK: OK: nf_conntrack is 0 % full [15:03:16] RECOVERY - salt-minion processes on mw2227 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:03:29] (03PS1) 10Urbanecm: Enable transwiki import for la.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293738 (https://phabricator.wikimedia.org/T137547) [15:03:34] this server really needs a disk controller with a BBU and better/more disks [15:04:56] RECOVERY - Disk space on mw2227 is OK: DISK OK [15:05:16] RECOVERY - MD RAID on mw2227 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:05:29] (03PS1) 10JanZerebecki: (DO NOT SUBMIT) contint: pin firefox to 46 on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/293739 (https://phabricator.wikimedia.org/T137561) [15:05:35] PROBLEM - Apache HTTP on mw2228 is CRITICAL: Connection timed out [15:05:35] PROBLEM - Apache HTTP on mw2229 is CRITICAL: Connection timed out [15:05:46] PROBLEM - mediawiki-installation DSH group on mw2229 is CRITICAL: Host mw2229 is not in mediawiki-installation dsh group [15:05:46] PROBLEM - mediawiki-installation DSH group on mw2228 is CRITICAL: Host mw2228 is not in mediawiki-installation dsh group [15:05:46] (03CR) 10JanZerebecki: [C: 04-1] (DO NOT SUBMIT) contint: pin firefox to 46 on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/293739 (https://phabricator.wikimedia.org/T137561) (owner: 10JanZerebecki) [15:05:47] RECOVERY - puppet last run on restbase-test2003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:06:06] PROBLEM - puppet last run on mw2224 is CRITICAL: CRITICAL: Puppet has 3 failures [15:06:17] PROBLEM - nutcracker port on mw2229 is CRITICAL: Timeout while attempting connection [15:06:17] PROBLEM - nutcracker port on mw2228 is CRITICAL: Timeout while attempting connection [15:06:45] PROBLEM - nutcracker process on mw2228 is CRITICAL: Timeout while attempting connection [15:06:46] PROBLEM - nutcracker process on mw2229 is CRITICAL: Timeout while attempting connection [15:06:46] RECOVERY - DPKG on mw2227 is OK: All packages OK [15:07:05] PROBLEM - puppet last run on mw2225 is CRITICAL: CRITICAL: Puppet has 3 failures [15:07:05] PROBLEM - puppet last run on mw2229 is CRITICAL: Timeout while attempting connection [15:07:05] PROBLEM - puppet last run on mw2228 is CRITICAL: Timeout while attempting connection [15:07:16] PROBLEM - salt-minion processes on mw2228 is CRITICAL: Timeout while attempting connection [15:07:16] PROBLEM - salt-minion processes on mw2229 is CRITICAL: Timeout while attempting connection [15:07:56] PROBLEM - Check size of conntrack table on mw2229 is CRITICAL: Timeout while attempting connection [15:07:56] PROBLEM - Check size of conntrack table on mw2228 is CRITICAL: Timeout while attempting connection [15:08:06] RECOVERY - mediawiki-installation DSH group on mw1269 is OK: OK [15:08:06] RECOVERY - mediawiki-installation DSH group on mw1270 is OK: OK [15:08:06] RECOVERY - mediawiki-installation DSH group on mw1271 is OK: OK [15:08:15] PROBLEM - DPKG on mw2229 is CRITICAL: Timeout while attempting connection [15:08:15] PROBLEM - DPKG on mw2228 is CRITICAL: Timeout while attempting connection [15:08:16] PROBLEM - Apache HTTP on mw2224 is CRITICAL: Connection refused [15:08:26] PROBLEM - Disk space on mw2229 is CRITICAL: Timeout while attempting connection [15:08:26] PROBLEM - Disk space on mw2228 is CRITICAL: Timeout while attempting connection [15:08:45] PROBLEM - MD RAID on mw2228 is CRITICAL: Timeout while attempting connection [15:08:46] PROBLEM - MD RAID on mw2229 is CRITICAL: Timeout while attempting connection [15:09:36] PROBLEM - configured eth on mw2228 is CRITICAL: Timeout while attempting connection [15:09:36] PROBLEM - configured eth on mw2229 is CRITICAL: Timeout while attempting connection [15:09:55] PROBLEM - dhclient process on mw2229 is CRITICAL: Timeout while attempting connection [15:09:55] PROBLEM - dhclient process on mw2228 is CRITICAL: Timeout while attempting connection [15:10:07] PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Puppet has 3 failures [15:10:25] PROBLEM - Apache HTTP on mw2225 is CRITICAL: Connection refused [15:12:13] (03CR) 10Yurik: [C: 031] "seems ok, not testested" [puppet] - 10https://gerrit.wikimedia.org/r/293475 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel) [15:12:26] PROBLEM - Apache HTTP on mw2226 is CRITICAL: Connection refused [15:15:03] 06Operations, 10Traffic, 13Patch-For-Review: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114#2371928 (10ema) 05Open>03Resolved a:03ema Grouping transactions by requests solved the problem, confirmed on cp1051 and cp1061. Closing. [15:15:41] (03PS2) 10Urbanecm: Permission changes in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293737 (https://phabricator.wikimedia.org/T137532) [15:17:26] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Puppet has 3 failures [15:17:36] (03PS3) 10Urbanecm: Permission changes in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293737 (https://phabricator.wikimedia.org/T137532) [15:18:55] PROBLEM - Apache HTTP on mw2227 is CRITICAL: Connection refused [15:21:47] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /page/mobile-sections/{title} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [15:22:16] RECOVERY - Apache HTTP on mw2228 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.077 second response time [15:23:26] RECOVERY - MD RAID on mw2228 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:23:27] RECOVERY - nutcracker process on mw2228 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [15:23:56] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [15:24:05] RECOVERY - salt-minion processes on mw2228 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:24:15] RECOVERY - configured eth on mw2228 is OK: OK - interfaces up [15:24:25] RECOVERY - Apache HTTP on mw2229 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.074 second response time [15:24:26] RECOVERY - dhclient process on mw2228 is OK: PROCS OK: 0 processes with command name dhclient [15:24:37] RECOVERY - Check size of conntrack table on mw2228 is OK: OK: nf_conntrack is 0 % full [15:25:06] RECOVERY - nutcracker port on mw2228 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [15:25:15] RECOVERY - Disk space on mw2228 is OK: DISK OK [15:26:17] RECOVERY - configured eth on mw2229 is OK: OK - interfaces up [15:26:36] RECOVERY - dhclient process on mw2229 is OK: PROCS OK: 0 processes with command name dhclient [15:26:46] RECOVERY - Check size of conntrack table on mw2229 is OK: OK: nf_conntrack is 0 % full [15:27:06] RECOVERY - DPKG on mw2228 is OK: All packages OK [15:27:16] RECOVERY - nutcracker port on mw2229 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [15:27:25] RECOVERY - Disk space on mw2229 is OK: DISK OK [15:27:37] RECOVERY - MD RAID on mw2229 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:27:46] RECOVERY - nutcracker process on mw2229 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [15:27:47] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2371939 (10Papaul) [15:28:16] RECOVERY - salt-minion processes on mw2229 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:29:17] RECOVERY - DPKG on mw2229 is OK: All packages OK [15:30:02] (03Abandoned) 10JanZerebecki: (DO NOT SUBMIT) contint: pin firefox to 46 on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/293739 (https://phabricator.wikimedia.org/T137561) (owner: 10JanZerebecki) [15:33:15] kart_: no we do not have separate repo for Beta [15:33:50] (03CR) 10Gehel: "Puppet compiler: https://puppet-compiler.wmflabs.org/3084/ looks good." [puppet] - 10https://gerrit.wikimedia.org/r/293475 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel) [15:34:10] (03PS4) 10Gehel: explicitely set input reader format in osm2pgsql [puppet] - 10https://gerrit.wikimedia.org/r/293475 (https://phabricator.wikimedia.org/T112423) [15:34:14] PROBLEM - Apache HTTP on mw2230 is CRITICAL: Connection timed out [15:34:14] PROBLEM - Apache HTTP on mw2231 is CRITICAL: Connection timed out [15:34:54] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:35:17] (03PS8) 10Gehel: Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) [15:35:34] PROBLEM - puppet last run on mw2232 is CRITICAL: Timeout while attempting connection [15:35:34] PROBLEM - puppet last run on mw2230 is CRITICAL: Timeout while attempting connection [15:35:34] PROBLEM - puppet last run on mw2231 is CRITICAL: Timeout while attempting connection [15:35:35] (03PS1) 10Jcrespo: Install parallel gzip (pigz) and parallel xz (pxz) on all servers [puppet] - 10https://gerrit.wikimedia.org/r/293743 [15:35:51] (03CR) 10Gehel: [C: 032] explicitely set input reader format in osm2pgsql [puppet] - 10https://gerrit.wikimedia.org/r/293475 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel) [15:35:55] PROBLEM - salt-minion processes on mw2232 is CRITICAL: Timeout while attempting connection [15:35:55] PROBLEM - salt-minion processes on mw2230 is CRITICAL: Timeout while attempting connection [15:35:55] PROBLEM - salt-minion processes on mw2231 is CRITICAL: Timeout while attempting connection [15:36:15] PROBLEM - Apache HTTP on mw2232 is CRITICAL: Connection timed out [15:36:42] (03PS9) 10Gehel: Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) [15:36:44] PROBLEM - Check size of conntrack table on mw2230 is CRITICAL: Timeout while attempting connection [15:36:44] PROBLEM - Check size of conntrack table on mw2231 is CRITICAL: Timeout while attempting connection [15:36:44] PROBLEM - Check size of conntrack table on mw2232 is CRITICAL: Timeout while attempting connection [15:36:54] PROBLEM - DPKG on mw2230 is CRITICAL: Timeout while attempting connection [15:36:54] PROBLEM - DPKG on mw2232 is CRITICAL: Timeout while attempting connection [15:36:54] PROBLEM - DPKG on mw2231 is CRITICAL: Timeout while attempting connection [15:37:04] PROBLEM - Disk space on mw2230 is CRITICAL: Timeout while attempting connection [15:37:04] PROBLEM - Disk space on mw2231 is CRITICAL: Timeout while attempting connection [15:37:04] PROBLEM - Disk space on mw2232 is CRITICAL: Timeout while attempting connection [15:37:34] PROBLEM - MD RAID on mw2231 is CRITICAL: Timeout while attempting connection [15:37:34] PROBLEM - MD RAID on mw2230 is CRITICAL: Timeout while attempting connection [15:37:34] PROBLEM - MD RAID on mw2232 is CRITICAL: Timeout while attempting connection [15:38:26] PROBLEM - configured eth on mw2230 is CRITICAL: Timeout while attempting connection [15:38:26] PROBLEM - configured eth on mw2231 is CRITICAL: Timeout while attempting connection [15:38:26] PROBLEM - configured eth on mw2232 is CRITICAL: Timeout while attempting connection [15:38:44] PROBLEM - dhclient process on mw2231 is CRITICAL: Timeout while attempting connection [15:38:45] PROBLEM - dhclient process on mw2232 is CRITICAL: Timeout while attempting connection [15:38:45] PROBLEM - dhclient process on mw2230 is CRITICAL: Timeout while attempting connection [15:38:54] PROBLEM - mediawiki-installation DSH group on mw2231 is CRITICAL: Host mw2231 is not in mediawiki-installation dsh group [15:38:54] PROBLEM - mediawiki-installation DSH group on mw2230 is CRITICAL: Host mw2230 is not in mediawiki-installation dsh group [15:38:54] PROBLEM - mediawiki-installation DSH group on mw2232 is CRITICAL: Host mw2232 is not in mediawiki-installation dsh group [15:39:04] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 3 failures [15:39:05] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [15:39:15] PROBLEM - nutcracker port on mw2231 is CRITICAL: Timeout while attempting connection [15:39:16] PROBLEM - nutcracker port on mw2232 is CRITICAL: Timeout while attempting connection [15:39:16] PROBLEM - nutcracker port on mw2230 is CRITICAL: Timeout while attempting connection [15:39:34] PROBLEM - nutcracker process on mw2232 is CRITICAL: Timeout while attempting connection [15:39:34] PROBLEM - nutcracker process on mw2230 is CRITICAL: Timeout while attempting connection [15:39:34] PROBLEM - nutcracker process on mw2231 is CRITICAL: Timeout while attempting connection [15:39:35] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Puppet has 3 failures [15:39:36] PROBLEM - Apache HTTP on mw2228 is CRITICAL: Connection refused [15:39:38] (03CR) 10Gehel: [C: 032] Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel) [15:40:11] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2371963 (10Papaul) OS install, puppet certs, salt-key, initial run done on mw2215-mw2232 [15:41:14] anybody that has free time to help me clearing all the codfw mw alerts? [15:41:36] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: add Ladsgroup to nda LDAP group (was: Grant graphite.wikimedia.org rights to grafana-admin LDAP group) - https://phabricator.wikimedia.org/T137373#2371972 (10Ladsgroup) > What is the difference between graphite metrics selection as seen o... [15:41:45] PROBLEM - Apache HTTP on mw2229 is CRITICAL: Connection refused [15:41:47] basically following https://etherpad.wikimedia.org/p/jessie-install [15:42:11] we have tons of new codfw mw servers but also a lot of criticals :D [15:42:55] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.248 second response time [15:43:28] elukey: what do you mean by clearing the alerts? Investigation? Or for loop to ack them? [15:43:45] nope completing the installs.. [15:44:29] elukey: I can try to help, but it might take you more time to explain what need doing than to do it... [15:45:14] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 9.667 second response time [15:45:26] gehel: let's try, it should be super hard.. so https://etherpad.wikimedia.org/p/jessie-install [15:45:44] if you see there are a lot of codfw servers that papaul just finished to reimage [15:46:26] elukey: yep, I see them [15:46:29] most of them are failing due to /bin/systemctl start mw-cgroup but with a reboot they should be fine (== puppet will run fine) [15:46:45] RECOVERY - Apache HTTP on mw2230 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.078 second response time [15:46:50] after that, it is only a matter of adding them to DSH + scap pull + apache restart [15:46:57] but [15:47:08] even having them with puppet running fine would be awesome [15:48:40] I am using something like @neon:~$ sudo icinga-downtime -h mw2215 -d 10800 -r "Bootstrapping --elukey" to schedule downtime [15:48:55] RECOVERY - configured eth on mw2230 is OK: OK - interfaces up [15:49:15] RECOVERY - dhclient process on mw2230 is OK: PROCS OK: 0 processes with command name dhclient [15:49:15] RECOVERY - Check size of conntrack table on mw2230 is OK: OK: nf_conntrack is 0 % full [15:49:35] RECOVERY - Disk space on mw2230 is OK: DISK OK [15:49:45] RECOVERY - nutcracker port on mw2230 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [15:50:04] RECOVERY - nutcracker process on mw2230 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [15:50:06] RECOVERY - MD RAID on mw2230 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:50:35] RECOVERY - salt-minion processes on mw2230 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:51:35] RECOVERY - DPKG on mw2230 is OK: All packages OK [15:57:25] RECOVERY - Apache HTTP on mw2216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.212 second response time [16:00:01] (03CR) 10JanZerebecki: [C: 031] Install parallel gzip (pigz) and parallel xz (pxz) on all servers [puppet] - 10https://gerrit.wikimedia.org/r/293743 (owner: 10Jcrespo) [16:01:55] RECOVERY - Apache HTTP on mw2232 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.079 second response time [16:02:46] RECOVERY - Disk space on mw2232 is OK: DISK OK [16:02:56] RECOVERY - nutcracker port on mw2232 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [16:03:14] RECOVERY - nutcracker process on mw2232 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [16:03:24] RECOVERY - MD RAID on mw2232 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:03:24] PROBLEM - puppet last run on mw2230 is CRITICAL: CRITICAL: Puppet has 3 failures [16:03:45] RECOVERY - salt-minion processes on mw2232 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:04:15] RECOVERY - configured eth on mw2232 is OK: OK - interfaces up [16:04:34] RECOVERY - dhclient process on mw2232 is OK: PROCS OK: 0 processes with command name dhclient [16:04:35] RECOVERY - Check size of conntrack table on mw2232 is OK: OK: nf_conntrack is 0 % full [16:05:23] mmmmm [16:05:24] Jun 10 15:57:47 mw2216 mkdir[3707]: /bin/mkdir: cannot create directory ‘/sys/fs/cgroup/memory’: Read-only file system [16:06:15] PROBLEM - Apache HTTP on mw2230 is CRITICAL: Connection refused [16:06:15] RECOVERY - Apache HTTP on mw2231 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.075 second response time [16:06:55] RECOVERY - DPKG on mw2232 is OK: All packages OK [16:07:44] RECOVERY - MD RAID on mw2231 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:07:49] (03PS1) 10Dzahn: switch git.wikimedia.org from misc to text cluster [dns] - 10https://gerrit.wikimedia.org/r/293747 (https://phabricator.wikimedia.org/T123718) [16:08:05] RECOVERY - salt-minion processes on mw2231 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:08:35] RECOVERY - configured eth on mw2231 is OK: OK - interfaces up [16:08:45] RECOVERY - dhclient process on mw2231 is OK: PROCS OK: 0 processes with command name dhclient [16:08:55] RECOVERY - Check size of conntrack table on mw2231 is OK: OK: nf_conntrack is 0 % full [16:09:15] RECOVERY - Disk space on mw2231 is OK: DISK OK [16:09:24] jynus, if you could take a look at https://phabricator.wikimedia.org/T137567 and let me know if it needs any more information, I'd be happy to provide it. [16:09:25] RECOVERY - nutcracker port on mw2231 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [16:09:35] RECOVERY - nutcracker process on mw2231 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [16:11:15] RECOVERY - DPKG on mw2231 is OK: All packages OK [16:11:32] halfak, see my comment [16:11:56] usually it only requires a puppet commit + a service restart, so it should not take much time [16:12:03] 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#2372089 (10hoo) 05Open>03declined We'll do {T137539} instead. [16:12:27] jynus, great news. Thanks! [16:14:25] PROBLEM - NTP on mw2232 is CRITICAL: NTP CRITICAL: Offset unknown [16:14:25] PROBLEM - NTP on mw2231 is CRITICAL: NTP CRITICAL: Offset unknown [16:15:15] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:16:25] PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Puppet has 3 failures [16:18:44] RECOVERY - NTP on mw2232 is OK: NTP OK: Offset 0.001352190971 secs [16:19:15] PROBLEM - Apache HTTP on mw2232 is CRITICAL: Connection refused [16:22:43] (03PS3) 10Dzahn: rm cp1043,cp1044 from site, installserver & torrus [puppet] - 10https://gerrit.wikimedia.org/r/293669 (https://phabricator.wikimedia.org/T133614) [16:23:23] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: add Ladsgroup to nda LDAP group (was: Grant graphite.wikimedia.org rights to grafana-admin LDAP group) - https://phabricator.wikimedia.org/T137373#2366461 (10Halfak) I also use graphite in this way. Clunky, indeed, but graphite is an *ex... [16:24:08] (03PS4) 10Dzahn: rm cp1043,cp1044 from site, installserver & torrus [puppet] - 10https://gerrit.wikimedia.org/r/293669 (https://phabricator.wikimedia.org/T133614) [16:25:08] (03CR) 10Dzahn: [C: 032] rm cp1043,cp1044 from site, installserver & torrus [puppet] - 10https://gerrit.wikimedia.org/r/293669 (https://phabricator.wikimedia.org/T133614) (owner: 10Dzahn) [16:25:14] RECOVERY - NTP on mw2231 is OK: NTP OK: Offset -0.00322675705 secs [16:26:05] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [16:27:38] !log cp1043/cp1044 - decom'ing, were already "Unused spare system" but running, scheduling downtime in icinga, shutting them down and removing from torrus config and puppet (T133614) [16:27:38] T133614: reclaim or decom: cp1043 + cp1044 - https://phabricator.wikimedia.org/T133614 [16:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:25] !log cp1043,cp1044 shutdown -h, confirmed not in pybal/confctl [16:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:15] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:37:35] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [16:41:46] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:42:40] (03CR) 1020after4: [C: 031] switch git.wikimedia.org from misc to text cluster [dns] - 10https://gerrit.wikimedia.org/r/293747 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [16:43:54] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.021 second response time [16:45:50] (03CR) 10Paladox: [C: 031] switch git.wikimedia.org from misc to text cluster [dns] - 10https://gerrit.wikimedia.org/r/293747 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [16:56:06] PROBLEM - Apache HTTP on mw2231 is CRITICAL: Connection refused [16:56:23] (03PS1) 10Elukey: Enable base::grub::enable_memory_cgroup on the new mw codfw servers. [puppet] - 10https://gerrit.wikimedia.org/r/293752 [16:57:29] (03CR) 10Muehlenhoff: [C: 04-1] "I like that, but pxz is not available in precise, so this needs an os_version conditional" [puppet] - 10https://gerrit.wikimedia.org/r/293743 (owner: 10Jcrespo) [16:58:25] (03CR) 10Dzahn: [C: 031] "matches the info from papaul on the etherpad" [puppet] - 10https://gerrit.wikimedia.org/r/293752 (owner: 10Elukey) [16:59:01] (03CR) 10Hashar: "Has anyone considered to keep it on misc-lb but change the backend to iridium (Phabricator host) and the redirects handled there? This w" [dns] - 10https://gerrit.wikimedia.org/r/293747 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [17:06:42] (03CR) 10Jcrespo: "Good catch- this can wait until precise no longer exists :-)" [puppet] - 10https://gerrit.wikimedia.org/r/293743 (owner: 10Jcrespo) [17:12:40] !log Updated the puppet compiler with new hosts/facts [17:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:18:18] (03PS5) 10Ori.livneh: Make mwrepl a little more user friendly [puppet] - 10https://gerrit.wikimedia.org/r/283604 (owner: 10EBernhardson) [17:18:27] (03CR) 10Ori.livneh: [C: 032 V: 032] Make mwrepl a little more user friendly [puppet] - 10https://gerrit.wikimedia.org/r/283604 (owner: 10EBernhardson) [17:32:24] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:25] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.115 second response time [17:42:00] (03PS1) 10Ori.livneh: MWMultiVersion: allow wiki to be specified via the environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293760 [17:46:26] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361418 (10Danny_B) First shot just off-head, untested. HTH. ``` # replace all %2f in repo name with / ^/(commit|log... [17:48:07] well shit [17:50:54] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [18:02:37] Damn it. Where is the "thank" button in Phab [18:03:42] halfak: "Award Token" and memes [18:04:01] Oh! [18:04:26] * halfak awards a token [18:04:27] :) [18:08:49] :) we even have the custom barnstar now [18:12:20] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2372568 (10Dzahn) Thank you DannyB! Btw, we have setup a labs instance for testing http://git.wmflabs.org/ and hap... [18:16:25] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [18:21:25] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:25] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.194 second response time [18:25:12] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2372576 (10demon) Also needs to handle: ``` git.wikimedia.org/ -> phabricator.wikimedia.org/diffusion/query/active/... [18:26:16] mutante: Added a few more, but I think that covers all of them [18:28:24] PROBLEM - Host cp1043 is DOWN: PING CRITICAL - Packet loss = 100% [18:29:14] PROBLEM - Host cp1044 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:13] (03CR) 10Kaldari: [C: 032] "Issues from security review have been resolved. Deploying to Beta Labs for testing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [18:30:32] (03CR) 10jenkins-bot: [V: 04-1] Load the RevisionSlider extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [18:30:57] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2372602 (10Paladox) [18:31:15] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361418 (10Paladox) [18:32:42] ostriches: awesome:) [18:34:33] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2372604 (10Paladox) [18:37:38] (03PS8) 1020after4: Avoid breaking full phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [18:44:07] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2218 is CRITICAL: Host mw2218 is not in mediawiki-installation dsh group daniel_zahn freshly installed servers [18:44:07] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2221 is CRITICAL: Host mw2221 is not in mediawiki-installation dsh group daniel_zahn freshly installed servers [18:44:07] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2222 is CRITICAL: Host mw2222 is not in mediawiki-installation dsh group daniel_zahn freshly installed servers [18:44:07] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2223 is CRITICAL: Host mw2223 is not in mediawiki-installation dsh group daniel_zahn freshly installed servers [18:44:07] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2224 is CRITICAL: Host mw2224 is not in mediawiki-installation dsh group daniel_zahn freshly installed servers [18:44:07] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2225 is CRITICAL: Host mw2225 is not in mediawiki-installation dsh group daniel_zahn freshly installed servers [18:44:08] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2226 is CRITICAL: Host mw2226 is not in mediawiki-installation dsh group daniel_zahn freshly installed servers [18:44:08] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2227 is CRITICAL: Host mw2227 is not in mediawiki-installation dsh group daniel_zahn freshly installed servers [18:44:09] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2228 is CRITICAL: Host mw2228 is not in mediawiki-installation dsh group daniel_zahn freshly installed servers [18:44:10] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2229 is CRITICAL: Host mw2229 is not in mediawiki-installation dsh group daniel_zahn freshly installed servers [18:44:10] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2230 is CRITICAL: Host mw2230 is not in mediawiki-installation dsh group daniel_zahn freshly installed servers [18:44:10] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2231 is CRITICAL: Host mw2231 is not in mediawiki-installation dsh group daniel_zahn freshly installed servers [18:47:29] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2217 is CRITICAL: Host mw2217 is not in mediawiki-installation dsh group daniel_zahn fresh installs [18:47:29] ACKNOWLEDGEMENT - puppet last run on mw2217 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn fresh installs [18:47:29] ACKNOWLEDGEMENT - puppet last run on mw2218 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn fresh installs [18:47:29] ACKNOWLEDGEMENT - Apache HTTP on mw2220 is CRITICAL: Connection refused daniel_zahn fresh installs [18:47:29] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2220 is CRITICAL: Host mw2220 is not in mediawiki-installation dsh group daniel_zahn fresh installs [18:47:38] sorry, tried to disable them before [18:47:51] but overrides a downtime [18:48:15] just so that they are not showing up like unhandled new issues [18:48:57] back to just 4 unhandled CRITs [18:50:44] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [18:56:04] (03PS1) 10Urbanecm: Temporary IP Cap Lift for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293788 (https://phabricator.wikimedia.org/T137587) [18:57:40] Hi, can anybody deploy this? https://gerrit.wikimedia.org/r/#/c/293788/ The event started before a few of minutes. Thanks. [18:57:47] It's for T137587. [18:57:47] T137587: Temporary IP Cap Lift - https://phabricator.wikimedia.org/T137587 [19:02:42] Pings: anomie ostriches thcipriani marktraceur [19:02:53] Pong [19:03:05] Oh, sorry, I'm not available for deploys [19:03:55] Ok. I saw you in SWAT member list so I thought that you can do it when you're online. [19:04:11] SWAT is a particular defined time period, and doesn't occur on Friday. [19:05:29] I know but I thought that any from SWAT can do any deploy at any time :) . If it's needful... [19:05:57] We can do deploys, but we check with the good ship greg-g (or whoever is greg-g today) before doing anything unscheduled. [19:06:39] "whoever is greg-g today" :) [19:06:55] And can someone do the checking soon? [19:07:37] ostriches: is it fine with you if we deploy a throttle exception that was requested with short-notice? [19:09:33] Thanks for asking thcipriani [19:11:06] (03PS1) 10Dzahn: varnish: git.wm.org to antimony, remove git-related config/tests [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) [19:11:15] thcipriani: Looks fine [19:11:27] ostriches: thanks. [19:11:35] * thcipriani does the needful [19:12:20] (03CR) 10Thcipriani: [C: 032] Temporary IP Cap Lift for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293788 (https://phabricator.wikimedia.org/T137587) (owner: 10Urbanecm) [19:12:52] (03Merged) 10jenkins-bot: Temporary IP Cap Lift for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293788 (https://phabricator.wikimedia.org/T137587) (owner: 10Urbanecm) [19:13:37] (03PS2) 10Ori.livneh: MWMultiVersion: allow wiki to be specified via the environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293760 [19:13:37] ah, crap. [19:14:05] Urbanecm: "wmgThrottlingExeptions" missing a 'c' in Exceptions [19:14:21] (03PS1) 10Cmjohnson: Adding netboot cfg and dhcpd entries for maps1001-4 [puppet] - 10https://gerrit.wikimedia.org/r/293790 [19:14:53] Oh... I'll fix it. [19:14:58] thank you. [19:15:51] In a new patch or in this by git review? I'm not sure if I can do it when a patch has been merged. [19:15:58] in a new patch [19:16:01] Ok. [19:16:52] thcipriani: do you mind if I merge a change in the meantime? I won't deploy throttle.php [19:17:11] ori: be my guest [19:17:14] thanks [19:17:27] (03CR) 10Ori.livneh: [C: 032] MWMultiVersion: allow wiki to be specified via the environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293760 (owner: 10Ori.livneh) [19:17:40] (03PS2) 10Cmjohnson: Adding netboot cfg and dhcpd entries for maps1001-4 [puppet] - 10https://gerrit.wikimedia.org/r/293790 [19:17:41] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:18:02] (03Merged) 10jenkins-bot: MWMultiVersion: allow wiki to be specified via the environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293760 (owner: 10Ori.livneh) [19:18:20] (03PS1) 10Urbanecm: Fix for ip lift cap for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293791 (https://phabricator.wikimedia.org/T137587) [19:18:51] Sent as https://gerrit.wikimedia.org/r/#/c/293791/ . [19:18:55] (03CR) 10Dzahn: "hashar, no it's still an option too. you might be right. what do others think? if we want that then we need https://gerrit.wikimedia.org/" [dns] - 10https://gerrit.wikimedia.org/r/293747 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [19:18:59] thcipriani, you allowed registrations in May :P [19:19:41] Urbanecm: ^ one other problem with the patch. [19:19:45] !log ori@tin Synchronized multiversion: Id432e25c: MWMultiVersion: allow wiki to be specified via the environment (duration: 00m 56s) [19:19:48] MaxSem: thanks for that :) [19:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:19:57] What? [19:20:25] 2016-05-10 should be 2016-06-10 [19:20:37] (03PS1) 10Papaul: Decommission:Remove mgmt DNS entries for es2005-es2010 from Bug:T134755 [dns] - 10https://gerrit.wikimedia.org/r/293792 (https://phabricator.wikimedia.org/T134755) [19:21:27] my whole point is that such requests 10 minutes before the start result in a rush and possible breakages. this one didn't break the wikis, but let's not try too hard :) [19:22:04] (03CR) 10Cmjohnson: [C: 032] Adding netboot cfg and dhcpd entries for maps1001-4 [puppet] - 10https://gerrit.wikimedia.org/r/293790 (owner: 10Cmjohnson) [19:22:08] MaxSem: definitely a fair point. Thank you for the sharp eye. [19:23:00] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2372708 (10Papaul) [19:23:04] (03PS2) 10Urbanecm: Fix for ip lift cap for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293791 (https://phabricator.wikimedia.org/T137587) [19:23:07] sharp eye and sharp tongue [19:23:14] sharp all around! :P [19:23:33] Fixed. [19:23:52] In PS2 in https://gerrit.wikimedia.org/r/#/c/293791/ :) [19:24:28] (03CR) 10Thcipriani: [C: 032] Fix for ip lift cap for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293791 (https://phabricator.wikimedia.org/T137587) (owner: 10Urbanecm) [19:25:17] (03PS3) 10Thcipriani: Fix for ip lift cap for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293791 (https://phabricator.wikimedia.org/T137587) (owner: 10Urbanecm) [19:26:18] (03CR) 10Thcipriani: Fix for ip lift cap for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293791 (https://phabricator.wikimedia.org/T137587) (owner: 10Urbanecm) [19:26:40] (03CR) 10Thcipriani: [C: 032] Fix for ip lift cap for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293791 (https://phabricator.wikimedia.org/T137587) (owner: 10Urbanecm) [19:27:19] (03CR) 10Chad: "This should be split up, the test cleanup/removal and removal of the X-Forwarded-Port should happen regardless of whether it goes to iridi" [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [19:27:23] (03Merged) 10jenkins-bot: Fix for ip lift cap for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293791 (https://phabricator.wikimedia.org/T137587) (owner: 10Urbanecm) [19:28:58] (03PS1) 10Ori.livneh: mwrepl improvements [puppet] - 10https://gerrit.wikimedia.org/r/293795 [19:30:11] 06Operations, 10ops-eqiad: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2372716 (10Cmjohnson) [19:30:13] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2372717 (10Dzahn) p:05High>03Normal also lowering priority because it is much less severe now. before: if a reboot happens does not co... [19:30:30] !log thcipriani@tin Synchronized wmf-config/throttle.php: [[gerrit:293791|Fix for ip lift cap for eswiki]] and [[gerrit:293788|Temporary IP Cap Lift for eswiki]] (duration: 00m 23s) [19:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:41] ^ Urbanecm sync'd! [19:31:16] Thanks :) . I don't like tasks like this :). As we can see, we need at least a hour, preferably a week :D [19:31:39] ^ definitely :) [19:31:41] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.003 second response time [19:33:42] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 1.393 second response time [19:33:53] (03CR) 10EBernhardson: [C: 031] "looks to do the trick without maintaining a separate startup macro. good stuff!" [puppet] - 10https://gerrit.wikimedia.org/r/293795 (owner: 10Ori.livneh) [19:34:17] (03PS2) 10Ori.livneh: mwrepl improvements [puppet] - 10https://gerrit.wikimedia.org/r/293795 [19:34:23] (03CR) 10Ori.livneh: [C: 032 V: 032] mwrepl improvements [puppet] - 10https://gerrit.wikimedia.org/r/293795 (owner: 10Ori.livneh) [19:34:57] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2372731 (10Papaul) [19:36:16] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2275897 (10Papaul) a:05Papaul>03RobH @Robh all steps complete you good to remove the entries on the switches. [19:36:52] more servers bite the dust [19:38:08] oh, that reminds me [19:38:18] !log cp1043/cp1044 - revoke puppet cert, salt key [19:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:41:08] 06Operations, 10Mail: not being able to send emails via Special:EmailUser - https://phabricator.wikimedia.org/T137337#2365463 (10Aklapper) [19:41:53] (03PS1) 10Dzahn: remove cp1043,cp1044 IPs [dns] - 10https://gerrit.wikimedia.org/r/293798 (https://phabricator.wikimedia.org/T133614) [19:43:49] (03CR) 10Dzahn: [C: 032] "already shut down" [dns] - 10https://gerrit.wikimedia.org/r/293798 (https://phabricator.wikimedia.org/T133614) (owner: 10Dzahn) [19:44:41] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: reclaim or decom: cp1043 + cp1044 - https://phabricator.wikimedia.org/T133614#2237371 (10Dzahn) deleted puppet certs, salt-keys, icinga stored config, removed from DNS still needs: mgmt DNS, racktables etc, switch ports, physical decom... [19:45:41] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: reclaim or decom: cp1043 + cp1044 - https://phabricator.wikimedia.org/T133614#2372769 (10Dzahn) @Robh @cmjohnson ^ [19:45:59] 06Operations, 10ops-eqiad: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2372770 (10Cmjohnson) Installed 1001/3/4 w/out an issue 1002 gives me this error Loading debian-installer/amd64/linux... failed: No such file or directory boot: Loading debian-installer/amd64/linux... [19:46:22] thx mutante [19:47:01] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: reclaim or decom: cp1043 + cp1044 - https://phabricator.wikimedia.org/T133614#2372771 (10RobH) a:03Cmjohnson We don't pull mgmt dns out until it is unracked, since they are still technically accessible. Assigning to @cmjohnson for disk... [19:47:08] i'll disable the switch ports now [19:47:14] but wont remove them since they arent unracked [19:48:04] remove what mgmt dns? [19:48:22] that task mentioned the steps left to do [19:48:37] listed mgmt dns as a step needed, i just commented we dont pull mgmt dns entries until the system is unracked. [19:48:51] mgmt can go...these are getting removed ....no need to remote access [19:49:19] cmjohnson1: yes, but until they are unracked, they should stay valid [19:49:25] thats the lifecycle and the workflow [19:49:33] damn workflow ;-) [19:49:38] if its racked and assessible, dont pull the entires (not sure why we would until its unracked) [19:51:05] mutante: did you power them off? [19:51:09] yes [19:51:47] cool, i'm going to make a checklist template and post on wikitech linked off the lifecycle [19:51:55] sounds good [19:51:57] so when someone makes a decom ticket, we can paste in the checklist for ease of reference [19:55:18] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: reclaim or decom: cp1043 + cp1044 - https://phabricator.wikimedia.org/T133614#2372773 (10RobH) [19:55:44] 06Operations, 10Mail: not being able to send emails via Special:EmailUser - https://phabricator.wikimedia.org/T137337#2365463 (10Quiddity) @Mardetanha Would you be able to find out what domain the editors are using as their email provider? (e.g. Yahoo.com, Gmail.com, Mail.ru, etc. **Not** their personal addres... [19:56:12] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.709 second response time [20:00:41] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.551 second response time [20:02:40] PROBLEM - puppet last run on suhail is CRITICAL: CRITICAL: puppet fail [20:13:15] (03PS1) 1020after4: use wmf/stable branch of arcanist and libphutil [puppet] - 10https://gerrit.wikimedia.org/r/293818 [20:20:08] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet has 2 failures [20:24:52] go puppet go. looking at mintaka... [20:25:10] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet has 2 failures [20:27:48] (03PS7) 10Paladox: git.wikimedia.org -> Diffusion redirects [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [20:30:09] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: puppet fail [20:30:20] RECOVERY - puppet last run on suhail is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:32:18] (03CR) 10Chad: [C: 04-1] "This is the wrong block to include it in, since it's on wikimedia.org and not mediawiki.org." [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [20:34:03] (03CR) 10Paladox: "@Chad hashar suggested that to mutante and me and him talked about that. We think it would be safer to do it with irdium since quicker dep" [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [20:34:21] (03CR) 10Paladox: "iridium" [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [20:35:08] RECOVERY - check_puppetrun on mintaka is OK: OK: Puppet is currently enabled, last run 218 seconds ago with 0 failures [20:39:32] (03PS8) 10Paladox: git.wikimedia.org -> Diffusion redirects [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [20:49:49] 06Operations: Investigate cp1044's strange Ganglia graphs - https://phabricator.wikimedia.org/T132859#2372862 (10Southparkfan) 05Open>03declined cp1044 has been decomissioned per T133614 [21:01:21] 06Operations, 10Mail: not being able to send emails via Special:EmailUser - https://phabricator.wikimedia.org/T137337#2372887 (10Mardetanha) >>! In T137337#2372774, @Quiddity wrote: > @Mardetanha Would you be able to find out what domain the editors are using as their email provider? (e.g. Yahoo.com, Gmail.com... [21:32:16] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2372974 (10Paladox) @demon I think we should redirect git.wikimedia.org to phabricator.wikimedia.org/diffusion/ [21:33:03] (03PS2) 10Ori.livneh: Remove old config hack that disabled $wgResponsiveImages on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287818 (owner: 10Brion VIBBER) [21:33:11] (03CR) 10Ori.livneh: [C: 032] Remove old config hack that disabled $wgResponsiveImages on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287818 (owner: 10Brion VIBBER) [21:33:51] (03Merged) 10jenkins-bot: Remove old config hack that disabled $wgResponsiveImages on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287818 (owner: 10Brion VIBBER) [21:37:34] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 2121 MB (3% inode=96%) [21:44:34] (03PS2) 10Ori.livneh: admin/ori: make 'reqs' continuously update [puppet] - 10https://gerrit.wikimedia.org/r/285581 [21:45:21] (03PS3) 10Ori.livneh: admin/ori: make 'reqs' continuously update [puppet] - 10https://gerrit.wikimedia.org/r/285581 [21:45:27] (03CR) 10Ori.livneh: [C: 032 V: 032] admin/ori: make 'reqs' continuously update [puppet] - 10https://gerrit.wikimedia.org/r/285581 (owner: 10Ori.livneh) [21:52:54] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [21:58:04] !log ori@tin Synchronized wmf-config/mobile.php: I3d8155d7e14: Remove old config hack that disabled $wgResponsiveImages on mobile (duration: 00m 24s) [21:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:59:04] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [22:03:04] (03CR) 10Ori.livneh: [C: 031] Enable base::firewall for osmium [puppet] - 10https://gerrit.wikimedia.org/r/293717 (owner: 10Muehlenhoff) [22:05:31] hello [22:05:36] hey [22:06:30] welcome to the matrix, yuvipanda M-gwicke [22:26:45] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.011 second response time [22:26:48] (03PS1) 10Jhobs: Prepare Wikidata descriptions on mobile for production rollout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293883 (https://phabricator.wikimedia.org/T127250) [22:28:53] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.016 second response time [22:34:53] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.754 second response time [22:36:53] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.685 second response time [22:50:25] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [22:56:52] (03PS1) 10Yuvipanda: graphite: Do not hardcode hostname into checks [puppet] - 10https://gerrit.wikimedia.org/r/293886 [22:56:54] (03PS1) 10Mholloway: Stop restricting carrier tagging to mobile subdomains [puppet] - 10https://gerrit.wikimedia.org/r/293887 [22:56:56] ori ^ [22:57:21] (03CR) 10Ori.livneh: [C: 031] "tnx" [puppet] - 10https://gerrit.wikimedia.org/r/293886 (owner: 10Yuvipanda) [22:57:25] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373135 (10Danny_B) >>! In T137224#2372974, @Paladox wrote: > @demon I think we should redirect git.wikimedia.org to... [22:57:43] (03PS2) 10Yuvipanda: graphite: Do not hardcode hostname into checks [puppet] - 10https://gerrit.wikimedia.org/r/293886 [22:57:56] (03CR) 10Yuvipanda: [C: 032 V: 032] graphite: Do not hardcode hostname into checks [puppet] - 10https://gerrit.wikimedia.org/r/293886 (owner: 10Yuvipanda) [22:58:01] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373136 (10Danny_B) a:03Danny_B [22:59:51] interesting. it works over https but not http [23:04:41] (03PS1) 10Mholloway: Hygiene: Remove refs to ZeroRatedMobileAccess [puppet] - 10https://gerrit.wikimedia.org/r/293888 [23:05:26] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373153 (10Danny_B) [23:12:58] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection timed out [23:13:33] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373193 (10Danny_B) [23:15:28] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2016-09-06 22:27:00 +0000 (expires in 87 days) [23:16:37] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [23:31:13] 06Operations, 10Analytics, 10Traffic: Make upload.wikimedia.org cookieless - https://phabricator.wikimedia.org/T137609#2373208 (10ori) [23:40:11] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373229 (10Danny_B) [23:51:17] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373237 (10Paladox) [23:55:42] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373243 (10Paladox) [23:56:20] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361418 (10Paladox) [23:58:09] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373246 (10Paladox) [23:59:43] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361418 (10Paladox)