[00:01:07] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1508645 (10CCogdill_WMF) This seems a bit strange to me as we send all fundraising email from @wikimedia.org... But in the interest of time, we will ac... [00:02:08] 6operations, 6Services: SCA: Move logs to /srv/ - https://phabricator.wikimedia.org/T107900#1508656 (10GWicke) Fwiw, I'm still in the no-local-logging camp. We have been using logstash exclusively for RESTBase and Parsoid for a long time and [for good reasons](https://wikitech.wikimedia.org/wiki/Incident_docum... [00:07:08] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [00:07:09] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [00:07:09] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [00:29:19] (03PS1) 10Mattflaschen: Disable two more wikis due to namespace conflicts. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229292 (https://phabricator.wikimedia.org/T107846) [00:36:32] (03PS7) 10Krinkle: mwgrep: Split results between public and private wikis [puppet] - 10https://gerrit.wikimedia.org/r/214037 (owner: 10Alex Monk) [00:36:50] Could an opsen deploy ^ soonish? [00:37:19] (03CR) 10Yuvipanda: [C: 032] mwgrep: Split results between public and private wikis [puppet] - 10https://gerrit.wikimedia.org/r/214037 (owner: 10Alex Monk) [00:38:20] Krinkle done [00:38:23] want me to run puppet on tin? [00:38:27] Yes please :) [00:38:31] Thank you Yuvi! [00:38:46] doing [00:40:23] Krinkle try now/ [00:46:37] YuviPanda: perfect [00:46:44] krinkle cool [00:46:59] Using it now to generate a few public repots [00:47:00] reports [00:48:09] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1509043 (10JanZerebecki) I don't know of any deployed functionality that edits from www.wikidata.org to other Wikis. Maybe who knows what gadgets... [00:48:46] Krenair: cool [00:48:47] err [00:48:52] my irc client thinks krinkle isn't here [00:48:53] it's funny [00:49:07] * Krinkle re-assures you he is here [00:49:52] alright [00:49:57] krinkle still in SF? [00:50:04] krinkle if so we should try to do a meeting about CVN before you leave [00:50:12] I am. One more week exactly. [00:50:15] (next Tue) [00:50:15] ah nice [00:53:29] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:54:35] (03PS1) 10Dzahn: librenms - enable LDAP auth (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) [00:55:29] PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:38] (03CR) 10jenkins-bot: [V: 04-1] librenms - enable LDAP auth (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) (owner: 10Dzahn) [00:57:38] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [00:57:39] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [01:04:08] PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:06:09] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [01:18:49] 6operations, 5Patch-For-Review: Configure librenms to use LDAP for authentication - https://phabricator.wikimedia.org/T107702#1509275 (10Dzahn) WIP patch above ^ ... did i get the options right for our LDAP server in eqiad? Also, see how the example has these 2 "unset" lines? Wondering how to write that when... [01:20:08] 6operations, 10RESTBase, 10Traffic, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1509301 (10Jdforrester-WMF) [01:21:10] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1509309 (10BBlack) >>! In T107940#1508645, @CCogdill_WMF wrote: > This seems a bit strange to me as we send all fundraising email from @wikimedia.org.... [01:23:30] PROBLEM - Restbase root url on praseodymium is CRITICAL: Connection refused [01:24:27] 6operations, 7Mail: mail alias for benefactorevents - https://phabricator.wikimedia.org/T107977#1509344 (10Dzahn) 3NEW [01:24:50] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [01:24:59] 6operations, 10Traffic: Stop using LVS from varnishes - https://phabricator.wikimedia.org/T107956#1509358 (10BBlack) Is our depool-for-deploy pattern really that fast that it wants to churn through several servers per second serially? What exactly is happening during that window that requires a depooled outag... [01:25:38] RECOVERY - Restbase root url on praseodymium is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.028 second response time [01:26:56] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1509367 (10Dzahn) >>! In T107940#1508645, @CCogdill_WMF wrote: > I don't believe this email address actually exists anywhere yet. Should we open a new... [01:29:09] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [01:31:22] 6operations, 7Mail: mail alias for benefactorevents - https://phabricator.wikimedia.org/T107977#1509371 (10BBlack) This would at least require us to MX benefactorevents.wikimedia.org back to us as well, which we can't do because it's a CNAME to a cloud provider... [01:44:54] !log restarting elasticsearch of es1005 [01:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:48:57] (03PS2) 10Dzahn: librenms - enable LDAP auth (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) [01:49:41] (03CR) 10jenkins-bot: [V: 04-1] librenms - enable LDAP auth (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) (owner: 10Dzahn) [01:53:40] (03PS2) 10Dzahn: Increase wikidata dispatch lag critical to >300s [puppet] - 10https://gerrit.wikimedia.org/r/224541 (owner: 10JanZerebecki) [01:57:07] (03CR) 10Dzahn: [C: 032] Increase wikidata dispatch lag critical to >300s [puppet] - 10https://gerrit.wikimedia.org/r/224541 (owner: 10JanZerebecki) [01:58:06] (03CR) 10Dzahn: "[neon:~] $ /usr/lib/nagios/plugins/check_http -H www.wikidata.org -I www.wikidata.org -S -u "/w/api.php?action=query&meta=siteinfo&format=" [puppet] - 10https://gerrit.wikimedia.org/r/224541 (owner: 10JanZerebecki) [02:03:24] well thats just odd :S 1005 statre at 1:51:37, detected master node at 1:51:42, then lost the master at 1:58:13. found master again at 1:58:25 [02:03:45] but in the process of losing and refinding the master we went from 172 unassigned shards to 30 [02:22:38] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:23:52] ^ "normal" [02:24:40] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 497 bytes in 0.009 second response time [02:28:10] !log l10nupdate Synchronized php-1.26wmf16/cache/l10n: (no message) (duration: 06m 56s) [02:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:39] Yeah, dangerously normal [02:31:44] !log @tin LocalisationUpdate completed (1.26wmf16) at 2015-08-05 02:31:44+00:00 [02:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:57:53] !log l10nupdate Synchronized php-1.26wmf17/cache/l10n: (no message) (duration: 10m 30s) [02:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:59:22] (03PS1) 10GWicke: Don't require nodejs for restbase [puppet] - 10https://gerrit.wikimedia.org/r/229304 [03:04:08] !log @tin LocalisationUpdate completed (1.26wmf17) at 2015-08-05 03:04:08+00:00 [03:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:06:15] y u no add $LOGUSER l10nupdate? [03:09:40] !log restarted elasticsearch on elastic1006 for 1.7.1 upgrade [03:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:09:56] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 7 below the confidence bounds [03:16:07] (03PS1) 10GWicke: Disable RESTBase config.yaml deploys in puppet [puppet] - 10https://gerrit.wikimedia.org/r/229306 (https://phabricator.wikimedia.org/T107532) [03:17:54] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:28] (03PS1) 10BryanDavis: Send $LOGNAME rather than $LOGUSER with dologmsg messages [puppet] - 10https://gerrit.wikimedia.org/r/229307 [03:20:30] (03PS1) 10BryanDavis: l10nupdate: provide a log message for sync-dir [puppet] - 10https://gerrit.wikimedia.org/r/229308 [03:21:56] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [03:33:38] (03CR) 10Legoktm: [C: 032] Disable two more wikis due to namespace conflicts. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229292 (https://phabricator.wikimedia.org/T107846) (owner: 10Mattflaschen) [03:33:44] (03Merged) 10jenkins-bot: Disable two more wikis due to namespace conflicts. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229292 (https://phabricator.wikimedia.org/T107846) (owner: 10Mattflaschen) [03:34:53] !log legoktm Synchronized wmf-config/InitialiseSettings.php: Disable two more wikis due to namespace conflicts - https://gerrit.wikimedia.org/r/229292 (duration: 00m 12s) [03:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:06:26] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 3 below the confidence bounds [04:13:16] PROBLEM - puppet last run on cp3004 is CRITICAL puppet fail [04:19:16] RECOVERY - Cassanda CQL query interface on restbase1009 is OK: TCP OK - 0.015 second response time on port 9042 [04:26:18] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1509635 (10GWicke) And we are complete: ``` Datacenter: eqiad... [04:35:34] (03CR) 10Ori.livneh: "clear-profile doesn't do anything now; can you remove it entirely?" [puppet] - 10https://gerrit.wikimedia.org/r/229307 (owner: 10BryanDavis) [04:37:12] ori: how about as a follow up? seems like a strange thing to mix in to that change [04:37:28] sure, ok [04:38:26] RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [04:38:41] (03CR) 10Ori.livneh: [C: 032] Send $LOGNAME rather than $LOGUSER with dologmsg messages [puppet] - 10https://gerrit.wikimedia.org/r/229307 (owner: 10BryanDavis) [04:41:20] ori: should I ensure=>absent too? [04:42:08] yeah, might as well. [04:43:36] (03PS1) 10BryanDavis: Remove clear-profile script and documentation [puppet] - 10https://gerrit.wikimedia.org/r/229322 [04:50:06] (03Abandoned) 10Brion VIBBER: Enable 240p Theora and WebM video transcodes for low-bandwidth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226445 (https://phabricator.wikimedia.org/T104063) (owner: 10Brion VIBBER) [04:50:24] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [04:50:47] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1509659 (10JeroenDeDauw) > including the REST API at /api/rest_v1/ What REST API are you talking about? [04:56:02] !log restarted elasticsearch on elastic1007 for 1.7.1 upgrade [04:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:59:16] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1509672 (10JeroenDeDauw) So it is now possible to deploy code with PHP 5.6 features? [05:13:59] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1509682 (10GWicke) @jeroendedauw: https://en.wikipedia.org/api/rest_v1/?doc [05:31:52] 6operations, 10RESTBase, 6Services, 10Traffic, 5Patch-For-Review: Provide an API listing at /api/ - https://phabricator.wikimedia.org/T107086#1509702 (10Spage) >>! In T107086#1497436, @GWicke wrote: > @spage, do you have the right to edit protected pages on meta? Nope. This is all very cool. > Set up a... [05:35:12] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1509707 (10MaxSem) I strongly suspect that we will bump our requirements only to 5.4 - however some folks out there really want to switch to Hack altogether. [05:40:58] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1509711 (10JeroenDeDauw) I am not asking about what versions of PHP that MediaWiki ought to work with. This is purely about what the restrictions are for PHP co... [05:42:21] 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 7Elasticsearch, 5Patch-For-Review: Use fixed ports for elasticsearch - https://phabricator.wikimedia.org/T107278#1509713 (10Dzahn) ``` root@elastic1001:~# netstat -tulpen | grep LISTEN | grep java tcp6 0 0 :::9300 :::*... [05:44:32] 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 7Elasticsearch: Use fixed ports for elasticsearch - https://phabricator.wikimedia.org/T107278#1509714 (10Dzahn) [05:46:26] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1509717 (10Legoktm) >>! In T86081#1509711, @JeroenDeDauw wrote: > So, PHP 5.6 fine or not? No. [05:46:59] 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 7Elasticsearch: Use fixed ports for elasticsearch - https://phabricator.wikimedia.org/T107278#1509718 (10Dzahn) 5Open>3Resolved a:3Dzahn Same on elastic1002. Looks resolved to me. tcp6 0 0 :::9300 :::* L... [05:48:25] 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 7Elasticsearch: Use fixed ports for elasticsearch - https://phabricator.wikimedia.org/T107278#1509722 (10Dzahn) [05:48:27] 6operations, 5Patch-For-Review: Ferm rules for elasticsearch - https://phabricator.wikimedia.org/T104962#1509721 (10Dzahn) [05:52:27] 6operations: integration.wikimedia.org redirect behavior is incorrect - https://phabricator.wikimedia.org/T84060#1509733 (10Dzahn) [05:54:21] 6operations, 7Icinga: Make nagios check_disk check for inode usage as well - https://phabricator.wikimedia.org/T84171#1509739 (10Dzahn) [05:56:03] 6operations: Setup install server in codfw - tftp done, but not apt and other install services - https://phabricator.wikimedia.org/T84380#1509744 (10Dzahn) [06:00:08] 6operations: Turn on Cirrus replicas for labswiki (wikitech) - https://phabricator.wikimedia.org/T83760#1509756 (10Dzahn) [06:03:13] 6operations: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1509759 (10Dzahn) [06:03:22] 6operations, 10Gitblit-Deprecate: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1509763 (10Dzahn) [06:03:46] 6operations, 10Gitblit-Deprecate: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#917458 (10Dzahn) see https://phabricator.wikimedia.org/tag/gitblit-deprecate/ [06:04:17] 6operations, 10Gitblit-Deprecate: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1509766 (10Dzahn) 5Open>3declined a:3Dzahn [06:05:21] 6operations, 10Gitblit-Deprecate: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#917458 (10Dzahn) @joe what do you think , was "declined" justified since there is the workboard to deprecate gitblit or do you want it reopened until the actual switch? [06:06:04] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1509769 (10Dzahn) [06:06:16] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1509770 (10Dzahn) 5declined>3Open [06:06:35] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#917458 (10Dzahn) a:5Dzahn>3None [06:16:44] 6operations, 10Wikimedia-Logstash: Import logstash 1.5.3 into apt.wm.o - https://phabricator.wikimedia.org/T107916#1509778 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [06:21:03] 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 7Elasticsearch: Use fixed ports for elasticsearch - https://phabricator.wikimedia.org/T107278#1509781 (10MoritzMuehlenhoff) Looks good (although Elasticsearch used this ports before as well). It's only that with the new configuration it won't switch to 920... [06:26:00] 6operations, 10ops-codfw, 7network: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#1509783 (10Dzahn) @papaul since you resolved the blocking task to connect the Apple airport, do you know what else we are missing here to also call this task resolved? [06:26:28] 6operations, 10ops-codfw, 7network: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#1509785 (10Dzahn) a:3Papaul [06:26:38] !log es1.7.1: upgrade elastic1008 [06:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:27:08] !log @tin ResourceLoader cache refresh completed at Wed Aug 5 06:27:08 UTC 2015 (duration 27m 7s) [06:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:28:15] <_joe_> dcausse: \o/ [06:28:47] _joe_: hi! [06:29:11] <_joe_> is the upgrade going faster now, with the index freezing? [06:29:24] 6operations, 10ops-codfw, 7network: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#1509789 (10Papaul) The wifi is not setup yet. [06:29:40] !log finish OSC gerrit 228756 s5 wb_items_per_site.ips_site_page [06:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:29:47] _joe_: nope, not as fast a we thought, so we had to resume writes [06:30:00] <_joe_> oh, that's a shame [06:30:13] <_joe_> let's blame manybubbles [06:30:22] yep :) [06:30:26] <_joe_> manybubbles: your software sucks!!!1! [06:30:28] <_joe_> :) [06:31:15] 7Puppet, 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: Puppet function: ipresolve: throw an error if lookup fails, refactor into wmflib - https://phabricator.wikimedia.org/T99833#1509791 (10Dzahn) a:3yuvipanda @yuvipanda looks like you resolved it? [06:31:54] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on mw1220 is CRITICAL Puppet has 2 failures [06:32:05] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 1 failures [06:32:06] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures [06:32:21] 7Puppet, 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: Puppet function: ipresolve: throw an error if lookup fails, refactor into wmflib - https://phabricator.wikimedia.org/T99833#1509794 (10yuvipanda) 5Open>3Resolved Think so! [06:32:35] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on mw1158 is CRITICAL Puppet has 1 failures [06:32:54] PROBLEM - puppet last run on db1056 is CRITICAL Puppet has 2 failures [06:32:55] PROBLEM - puppet last run on mw1120 is CRITICAL Puppet has 1 failures [06:33:04] PROBLEM - puppet last run on cp1068 is CRITICAL Puppet has 1 failures [06:33:05] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 1 failures [06:35:42] _joe_, he left us to make it better! [06:35:50] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1509796 (10Joe) I think this ticket should stay open, to testify how much harm and inertia some abandonware might cause. We should have had the courage to admint that # Gitb... [06:36:33] <_joe_> MaxSem: and we will still make fun of him. The world is unfair! [06:55:44] RECOVERY - puppet last run on cp1068 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:56:45] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on mw1220 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:56] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:57:25] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:57:35] RECOVERY - puppet last run on mw1158 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:44] RECOVERY - puppet last run on db1056 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:45] RECOVERY - puppet last run on mw1120 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:55] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:46] (03PS2) 10EBernhardson: Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [06:59:05] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:11] (03CR) 10EBernhardson: [C: 031] "The requisite patch has been merged and deployed. We should be good to turn this off." [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [07:06:53] (03PS1) 10Jcrespo: depool db1056 for maintenance, db1064 set to 100% load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229332 [07:08:50] (03CR) 10Jcrespo: [C: 032] depool db1056 for maintenance, db1064 set to 100% load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229332 (owner: 10Jcrespo) [07:12:52] !log jynus Synchronized wmf-config/db-eqiad.php: depool db1056 for maintenance, db1064 set to 100% (duration: 00m 12s) [07:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:13:10] 6operations, 10Wikimedia-Logstash: Import logstash 1.5.3 into apt.wm.o - https://phabricator.wikimedia.org/T107916#1509826 (10MoritzMuehlenhoff) logstash 1.5.3-1 has been imported for jessie-wikimedia on apt.wikimedia.org [07:13:17] 6operations, 10Wikimedia-Logstash: Import logstash 1.5.3 into apt.wm.o - https://phabricator.wikimedia.org/T107916#1509827 (10MoritzMuehlenhoff) 5Open>3Resolved [07:33:06] https://phabricator.wikimedia.org/T107995 [07:33:38] flow broken in 1.26wmf17 [07:46:25] !log es1.7.1: upgrade elastic1009 [07:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:27:34] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:31:54] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [08:33:25] PROBLEM - mediawiki-installation DSH group on mw1061 is CRITICAL: Host mw1061 is not in mediawiki-installation dsh group [08:35:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This was discussed during the ops meeting on Monday. It was turned down as an approach" [puppet] - 10https://gerrit.wikimedia.org/r/229306 (https://phabricator.wikimedia.org/T107532) (owner: 10GWicke) [08:37:26] (03CR) 10Bugreporter: "Several suggestions" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229197 (https://phabricator.wikimedia.org/T107927) (owner: 10Aude) [08:37:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] Don't require nodejs for restbase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229304 (owner: 10GWicke) [08:53:04] thedj: looking into it [08:53:30] (03PS1) 10Muehlenhoff: The generation of the openjdk source packages requires wdiff, so add it to the dependencies installed by package_builder. [puppet] - 10https://gerrit.wikimedia.org/r/229338 [08:54:57] works for me... [09:04:19] 6operations, 10RESTBase-Cassandra: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1510028 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [09:06:37] 6operations, 10RESTBase-Cassandra: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1510048 (10MoritzMuehlenhoff) An openjdk-8 is running, it will also be uploaded to jessie-backports. Filippo and I will take care of keeping it updated there for the quarterly security releases... [09:08:38] !log es1.7.1: upgrade elastic1010 [09:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:18:10] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1510076 (10Tau) {F459729} Have some problems with running this patch... [09:20:52] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1510079 (10akosiaris) Stale processes left behind by apertium-apy is c... [09:28:31] (03CR) 10Alexandros Kosiaris: [C: 032] The generation of the openjdk source packages requires wdiff, so add it to the dependencies installed by package_builder. [puppet] - 10https://gerrit.wikimedia.org/r/229338 (owner: 10Muehlenhoff) [09:37:05] PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:38:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] "various comments around" (036 comments) [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [09:39:04] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [09:44:39] 6operations, 7Database: db1002-db1007 - decom or repurpose? - https://phabricator.wikimedia.org/T103005#1510129 (10jcrespo) [09:44:40] 6operations, 10ops-eqiad, 7Database, 5Patch-For-Review: Remove db1002-db1007 from production - https://phabricator.wikimedia.org/T105768#1510130 (10jcrespo) [09:45:04] 6operations, 7Database: db1002-db1007 - decom or repurpose? - https://phabricator.wikimedia.org/T103005#1380026 (10jcrespo) [09:45:06] 6operations, 10ops-eqiad, 7Database, 5Patch-For-Review: Remove db1002-db1007 from production - https://phabricator.wikimedia.org/T105768#1451207 (10jcrespo) [10:01:33] 6operations, 7Database: review eqiad database server quantities / warranties / service(s) - https://phabricator.wikimedia.org/T103936#1510179 (10jcrespo) So the plan is, for critical databases with immediate needs, use replacements parts coming from decommissioning db1002 to db1007, maybe db1035 too. The long... [10:03:07] 6operations, 10RESTBase-Cassandra: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1510181 (10mobrovac) Many thanks @MoritzMuehlenhoff and @fgiunchedi ! [10:03:37] 6operations, 7discovery-system: Remove etcd1001,from the etcd cluster, decommission them. - https://phabricator.wikimedia.org/T108010#1510189 (10Joe) 3NEW a:3Joe [10:03:39] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1510197 (10akosiaris) Hello, So * /var/log/kartotherian/* - accessible * /var/log/cassandra/* - accessible * /var/log/postgres/* - not ac... [10:05:49] (03PS1) 10Jakob: Add php5-curl package to Phragile. [puppet] - 10https://gerrit.wikimedia.org/r/229355 (https://phabricator.wikimedia.org/T101235) [10:09:02] 6operations, 6Services: SCA: Move logs to /srv/ - https://phabricator.wikimedia.org/T107900#1510206 (10akosiaris) Yeah, we can log to /srv/log instead and possibly create a symlink from /var/log/. From what I see it is already configurable in service::configuration but will need some extra love. I 'll... [10:30:10] 6operations, 6Labs, 10wikitech.wikimedia.org: Turn on Cirrus replicas for labswiki (wikitech) - https://phabricator.wikimedia.org/T83760#1510257 (10Krenair) [10:30:11] (03PS1) 10KartikMistry: Limit number of APY instances to 8 [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/229359 [10:34:26] PROBLEM - Restbase root url on restbase1009 is CRITICAL: Connection refused [10:36:01] (03PS2) 10Filippo Giunchedi: l10nupdate: provide a log message for sync-dir [puppet] - 10https://gerrit.wikimedia.org/r/229308 (owner: 10BryanDavis) [10:36:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] l10nupdate: provide a log message for sync-dir [puppet] - 10https://gerrit.wikimedia.org/r/229308 (owner: 10BryanDavis) [10:36:14] !log applying schema change for s4 on codfw, some lag expected [10:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:37:07] 6operations, 6Labs, 10wikitech.wikimedia.org: Turn on Cirrus replicas for labswiki (wikitech) - https://phabricator.wikimedia.org/T83760#1510263 (10Krenair) @manybubbles, @ottomata: It doesn't look like the other wikis set this... Since wikitech is now part of the normal system, is there still anything to do... [10:43:37] !log upgrading asw-b-codfw to newer junos [10:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:46:15] 6operations, 10RESTBase-Cassandra: upgrade RESTBase cluster to Cassandra 2.1.8 - https://phabricator.wikimedia.org/T107949#1510274 (10fgiunchedi) proposed plan: * upgrade cassandra to 2.1.8 via deb upgrades on the staging cluster * benchmark/stresstest * upload package to apt.w.o and upgrade production cluster [10:47:15] RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.018 second response time [10:52:11] !log pool restbase100[789] in pybal [10:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:57:41] (03CR) 10Filippo Giunchedi: [C: 04-1] "grafana v2 ships a backend server as opposed to everything client side, I think we should start with a jessie VM" [puppet] - 10https://gerrit.wikimedia.org/r/229132 (https://phabricator.wikimedia.org/T107832) (owner: 10Dzahn) [10:58:48] 6operations, 10RESTBase, 6Services, 7RESTBase-architecture: Update restbase100[1-6] to the 3.19 kernel - https://phabricator.wikimedia.org/T102234#1510305 (10fgiunchedi) [10:58:51] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1510306 (10fgiunchedi) [10:58:54] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1510303 (10fgiunchedi) 5Open>3Resolved >>! In T102015#1509635, @GWicke wrote: > ..aand we are complete: > > ``` > Datacenter: eqiad... [11:01:47] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This is going to spawn way too many process in labs instances/vagrant VMs, testing environments, possibly causing problems and hardcoding " [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/229359 (owner: 10KartikMistry) [11:01:56] !log depool restbase1009, investigating healthcheck returning 500s [11:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:51] (03PS1) 10Jcrespo: Repool db1056, depool db1059 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229367 [11:15:42] (03CR) 10Jcrespo: [C: 032] Repool db1056, depool db1059 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229367 (owner: 10Jcrespo) [11:17:48] !log jynus Synchronized wmf-config/db-eqiad.php: repool db1056, depool db1059 (duration: 00m 12s) [11:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:22] switch is rebooting [11:26:36] PROBLEM - Host mw2141 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:38] don't be alarmed [11:26:41] 6operations, 6Services, 10hardware-requests: Assign wmf4541,wmf4543 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1510377 (10akosiaris) >>! In T107287#1506700, @mobrovac wrote: >>>! In T107287#1506524, @akosiaris wrote: >> * The SCA cluster at this point has re... [11:26:45] PROBLEM - Host mw2121 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:45] PROBLEM - Host mw2089 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:45] PROBLEM - Host mw2137 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:45] PROBLEM - Host mw2097 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:45] PROBLEM - Host mw2138 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:46] PROBLEM - Host mw2134 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:46] PROBLEM - Host mw2117 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:47] PROBLEM - Host mw2131 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:47] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:48] PROBLEM - Host mw2104 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:48] PROBLEM - Host cp2010 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:49] PROBLEM - Host mw2112 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:49] PROBLEM - Host mw2132 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:50] PROBLEM - Host mw2120 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:05] PROBLEM - Host mw2127 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:06] PROBLEM - Host mw2118 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:06] PROBLEM - Host mw2103 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:06] PROBLEM - Host db2016 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:06] PROBLEM - Host mw2094 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:06] PROBLEM - Host mw2091 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:07] PROBLEM - Host mw2092 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:07] PROBLEM - Host ms-be2008 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:08] PROBLEM - Host mw2083 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:08] PROBLEM - Host mw2086 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:16] PROBLEM - Host db2018 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:16] PROBLEM - Host mw2107 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:16] PROBLEM - Host mw2124 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:16] PROBLEM - Host mw2119 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:16] PROBLEM - Host mw2095 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:16] PROBLEM - Host mw2128 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:17] PROBLEM - Host mw2140 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:17] PROBLEM - Host mw2125 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:17] PROBLEM - Host mw2096 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:18] PROBLEM - Host mw2115 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:18] PROBLEM - Host mw2084 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:19] PROBLEM - Host wtp2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:24] PROBLEM - Host cp2012 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:25] PROBLEM - Host mw2122 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:25] PROBLEM - Host cp2007 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:25] PROBLEM - Host mw2113 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:34] PROBLEM - Host 2620:0:860:2:208:80:153:42 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:208:80:153:42 [11:27:35] PROBLEM - Host mw2146 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:35] PROBLEM - Host mw2145 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:35] PROBLEM - Host mw2136 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:35] PROBLEM - Host mw2144 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:35] PROBLEM - Host mw2106 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:35] PROBLEM - Host mw2101 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:36] PROBLEM - Host mw2130 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:36] PROBLEM - Host mw2114 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:37] PROBLEM - Host mw2109 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:37] PROBLEM - Host mw2108 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:41] (03PS2) 10Filippo Giunchedi: Remove clear-profile script and documentation [puppet] - 10https://gerrit.wikimedia.org/r/229322 (owner: 10BryanDavis) [11:27:45] PROBLEM - Host mw2123 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:45] PROBLEM - Host mw2100 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:45] PROBLEM - Host mw2129 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:46] PROBLEM - Host mw2102 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:46] PROBLEM - Host db2029 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:46] PROBLEM - Host mw2105 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:46] PROBLEM - Host mw2093 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:47] PROBLEM - Host mw2111 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:47] PROBLEM - Host wtp2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:48] PROBLEM - Host db2030 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:48] PROBLEM - Host wtp2010 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Remove clear-profile script and documentation [puppet] - 10https://gerrit.wikimedia.org/r/229322 (owner: 10BryanDavis) [11:27:49] PROBLEM - Host mw2126 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:49] PROBLEM - Host wtp2006 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:50] PROBLEM - Host mw2090 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:05] PROBLEM - Host mw2143 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:05] PROBLEM - Host mw2135 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:08] wth ? [11:28:14] PROBLEM - Host pollux is DOWN: CRITICAL - Network Unreachable (208.80.153.43) [11:28:15] PROBLEM - Host mw2085 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:15] PROBLEM - Host mw2082 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:15] PROBLEM - Host mc2012 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:25] PROBLEM - Host nembus is DOWN: CRITICAL - Network Unreachable (208.80.153.44) [11:28:26] PROBLEM - Host rdb2004 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:35] PROBLEM - Host achernar is DOWN: CRITICAL - Network Unreachable (208.80.153.42) [11:28:44] PROBLEM - Host es2008 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:54] PROBLEM - Host rdb2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:54] PROBLEM - Host db2019 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:54] PROBLEM - Host db2028 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:54] PROBLEM - Host wtp2008 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:54] PROBLEM - Host subra is DOWN: PING CRITICAL - Packet loss = 100% [11:29:14] codfw is down [11:29:22] no it's not [11:29:25] 14:26 < paravoid> switch is rebooting [11:29:25] 14:26 < icinga-wm> PROBLEM - Host mw2141 is DOWN: PING CRITICAL - Packet loss = 100% [11:29:29] 14:26 < paravoid> don't be alarmed [11:30:00] I only executed the alter table on codfw on localhost by chance, good for me [11:30:01] well played on 'alarmed' [11:30:12] I logged it before too [11:30:35] PROBLEM - Router interfaces on cr1-codfw is CRITICAL host 208.80.153.192, interfaces up: 106, down: 2, dormant: 0, excluded: 0, unused: 0BRae2: down - Core: asw-b-codfw:ae1BRet-0/0/1: down - asw-b-codfw:et-2/0/51 {#10703} [40Gbps DF]BR [11:30:36] PROBLEM - configured eth on lvs2003 is CRITICAL: eth1 reporting no carrier. [11:30:37] a ok [11:30:44] PROBLEM - Router interfaces on cr2-codfw is CRITICAL host 208.80.153.193, interfaces up: 102, down: 2, dormant: 0, excluded: 0, unused: 0BRae2: down - Core: asw-b-codfw:ae2BRet-0/0/1: down - asw-b-codfw:et-7/0/52 {#10707} [40Gbps DF]BR [11:30:46] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:30:46] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:30:54] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:30:54] PROBLEM - IPsec on cp3014 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:30:56] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:31:05] PROBLEM - HHVM rendering on mw2179 is CRITICAL - Socket timeout after 10 seconds [11:31:25] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2009_v4, cp2009_v6 [11:31:25] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:31:25] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:31:25] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2009_v4, cp2009_v6 [11:31:26] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:31:26] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:31:26] PROBLEM - IPsec on cp3019 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2012_v4, cp2012_v6 [11:31:26] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:31:45] PROBLEM - configured eth on lvs2001 is CRITICAL: eth1 reporting no carrier. [11:31:46] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:31:46] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2009_v4, cp2009_v6 [11:31:46] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2012_v4, cp2012_v6 [11:31:46] PROBLEM - IPsec on cp3021 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2012_v4, cp2012_v6 [11:31:46] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:31:46] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:31:46] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:31:47] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:31:47] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:31:48] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:31:54] PROBLEM - configured eth on lvs2002 is CRITICAL: eth1 reporting no carrier. [11:32:06] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL - Plugin timed out while executing system call [11:32:15] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2012_v4, cp2012_v6 [11:32:15] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2012_v4, cp2012_v6 [11:32:15] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:32:15] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:32:15] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:32:15] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:32:15] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:32:16] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:32:16] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:32:17] PROBLEM - IPsec on cp3012 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:32:24] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:32:24] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:32:24] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:32:24] PROBLEM - IPsec on cp3020 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2012_v4, cp2012_v6 [11:32:24] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:32:24] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:32:24] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:32:34] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2012_v4, cp2012_v6 [11:32:34] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:32:34] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2009_v4, cp2009_v6 [11:32:34] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:32:35] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:32:35] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2009_v4, cp2009_v6 [11:32:35] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2009_v4, cp2009_v6 [11:32:35] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2009_v4, cp2009_v6 [11:32:36] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:32:36] PROBLEM - IPsec on cp3013 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:32:37] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [11:32:37] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:32:38] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:32:38] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [11:32:44] 6operations, 6Services, 10hardware-requests: Assign WMF5842, WMF5843 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1510393 (10akosiaris) [11:32:53] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, good to merge anytime I guess (?)" [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [11:33:05] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2009_v4, cp2009_v6 [11:33:15] PROBLEM - HHVM rendering on mw2195 is CRITICAL - Socket timeout after 10 seconds [11:33:55] PROBLEM - HHVM rendering on mw2058 is CRITICAL - Socket timeout after 10 seconds [11:34:05] RECOVERY - configured eth on lvs2002 is OK - interfaces up [11:34:15] RECOVERY - Host mw2125 is UPING WARNING - Packet loss = 80%, RTA = 51.75 ms [11:34:15] RECOVERY - Host mw2100 is UPING WARNING - Packet loss = 80%, RTA = 51.76 ms [11:34:15] RECOVERY - Host mw2113 is UPING WARNING - Packet loss = 80%, RTA = 51.74 ms [11:34:15] RECOVERY - Host mw2102 is UPING WARNING - Packet loss = 80%, RTA = 52.03 ms [11:34:15] RECOVERY - Host mw2129 is UPING WARNING - Packet loss = 73%, RTA = 52.82 ms [11:34:15] RECOVERY - Host mw2104 is UPING WARNING - Packet loss = 73%, RTA = 52.07 ms [11:34:15] RECOVERY - Host db2029 is UPING WARNING - Packet loss = 73%, RTA = 52.09 ms [11:34:16] RECOVERY - Host mw2105 is UPING WARNING - Packet loss = 73%, RTA = 53.17 ms [11:34:16] RECOVERY - Host mw2122 is UPING WARNING - Packet loss = 73%, RTA = 51.75 ms [11:34:17] RECOVERY - Host mw2093 is UPING WARNING - Packet loss = 73%, RTA = 51.96 ms [11:34:18] RECOVERY - Host db2030 is UPING WARNING - Packet loss = 73%, RTA = 51.90 ms [11:34:18] RECOVERY - Host wtp2003 is UPING WARNING - Packet loss = 73%, RTA = 52.04 ms [11:34:30] (03CR) 10Filippo Giunchedi: [C: 04-1] "doesn't look like this is related with the investigation in T106619" [puppet] - 10https://gerrit.wikimedia.org/r/227335 (https://phabricator.wikimedia.org/T106619) (owner: 10GWicke) [11:35:38] RECOVERY - Host 208.80.153.42 is UPING OK - Packet loss = 0%, RTA = 54.06 ms [11:35:45] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 16 ESP OK [11:35:46] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 16 ESP OK [11:35:46] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 42 ESP OK [11:35:46] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 42 ESP OK [11:35:46] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 42 ESP OK [11:35:46] RECOVERY - IPsec on cp3019 is OK: Strongswan OK - 16 ESP OK [11:35:46] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 42 ESP OK [11:35:47] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 42 ESP OK [11:36:04] RECOVERY - HHVM rendering on mw2058 is OK: HTTP OK: HTTP/1.1 200 OK - 67223 bytes in 0.553 second response time [11:36:04] RECOVERY - configured eth on lvs2001 is OK - interfaces up [11:36:05] RECOVERY - Host 2620:0:860:2:208:80:153:42 is UPING OK - Packet loss = 0%, RTA = 52.04 ms [11:36:05] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 42 ESP OK [11:36:06] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 16 ESP OK [11:36:06] RECOVERY - IPsec on cp4004 is OK: Strongswan OK - 16 ESP OK [11:36:06] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 42 ESP OK [11:36:06] RECOVERY - IPsec on cp3021 is OK: Strongswan OK - 16 ESP OK [11:36:06] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 32 ESP OK [11:36:07] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 32 ESP OK [11:36:07] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 42 ESP OK [11:36:08] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 32 ESP OK [11:36:08] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 32 ESP OK [11:38:44] PROBLEM - puppet last run on wtp2003 is CRITICAL puppet fail [11:38:44] PROBLEM - puppet last run on db2030 is CRITICAL puppet fail [11:38:45] PROBLEM - puppet last run on db2028 is CRITICAL puppet fail [11:38:45] PROBLEM - puppet last run on mw2121 is CRITICAL puppet fail [11:38:45] PROBLEM - puppet last run on rdb2004 is CRITICAL puppet fail [11:39:04] PROBLEM - puppet last run on graphite2001 is CRITICAL puppet fail [11:39:35] PROBLEM - puppet last run on mw2125 is CRITICAL puppet fail [11:39:35] PROBLEM - puppet last run on mw2138 is CRITICAL puppet fail [11:39:35] PROBLEM - puppet last run on mw2124 is CRITICAL puppet fail [11:39:56] PROBLEM - puppet last run on lvs2006 is CRITICAL puppet fail [11:40:05] PROBLEM - puppet last run on mw2136 is CRITICAL Puppet has 37 failures [11:40:05] PROBLEM - puppet last run on db2016 is CRITICAL puppet fail [11:40:15] PROBLEM - puppet last run on wtp2006 is CRITICAL puppet fail [11:40:16] PROBLEM - puppet last run on eventlog2001 is CRITICAL Puppet has 26 failures [11:40:16] PROBLEM - puppet last run on mw2120 is CRITICAL puppet fail [11:40:16] PROBLEM - puppet last run on mw2088 is CRITICAL puppet fail [11:40:16] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [11:40:16] PROBLEM - puppet last run on mw2108 is CRITICAL puppet fail [11:40:25] PROBLEM - puppet last run on mw2122 is CRITICAL puppet fail [11:40:25] PROBLEM - puppet last run on mw2146 is CRITICAL puppet fail [11:40:25] PROBLEM - puppet last run on mw2095 is CRITICAL puppet fail [11:40:25] PROBLEM - puppet last run on wtp2004 is CRITICAL puppet fail [11:40:25] PROBLEM - puppet last run on mc2008 is CRITICAL puppet fail [11:40:26] PROBLEM - puppet last run on db2019 is CRITICAL puppet fail [11:40:26] PROBLEM - puppet last run on ganeti2006 is CRITICAL puppet fail [11:40:45] PROBLEM - puppet last run on cp2009 is CRITICAL puppet fail [11:40:55] PROBLEM - puppet last run on mw2132 is CRITICAL puppet fail [11:40:55] PROBLEM - puppet last run on mc2010 is CRITICAL Puppet has 1 failures [11:42:55] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.132 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [11:45:03] (03CR) 10KartikMistry: "Should we configure 'Number of instances' in Puppet then? (PS: apertium.org use -j1 and they seems OK with it)." [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/229359 (owner: 10KartikMistry) [11:46:36] PROBLEM - puppet last run on mc2007 is CRITICAL puppet fail [11:49:34] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 1 failures [11:55:59] (03CR) 10Alexandros Kosiaris: "For mitigating the problem we have in production? Sure. It won't really solve the problem but it will be helpful as a mitigation" [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/229359 (owner: 10KartikMistry) [11:57:26] RECOVERY - puppet last run on mc2007 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [11:57:35] RECOVERY - puppet last run on mw2120 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [11:57:35] RECOVERY - puppet last run on eventlog2001 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:57:44] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:57:45] RECOVERY - puppet last run on wtp2004 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [11:57:45] RECOVERY - puppet last run on mw2146 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [11:58:15] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:58:34] RECOVERY - puppet last run on graphite2001 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [11:59:45] RECOVERY - puppet last run on lvs2006 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:59:45] RECOVERY - puppet last run on mw2136 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:59:45] RECOVERY - puppet last run on db2016 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:00:05] RECOVERY - puppet last run on mw2088 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:00:15] RECOVERY - puppet last run on mw2095 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:00:15] RECOVERY - puppet last run on mc2008 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:00:44] RECOVERY - puppet last run on wtp2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:00:44] RECOVERY - puppet last run on mw2132 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:01:34] RECOVERY - puppet last run on mw2125 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [12:01:34] RECOVERY - puppet last run on mw2138 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:02:29] RECOVERY - puppet last run on mw2122 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:02:29] RECOVERY - puppet last run on ganeti2006 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:02:45] RECOVERY - puppet last run on cp2009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:02:55] RECOVERY - puppet last run on rdb2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:04:25] RECOVERY - puppet last run on wtp2006 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:04:26] RECOVERY - puppet last run on mw2108 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:04:35] RECOVERY - puppet last run on db2019 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [12:05:04] RECOVERY - puppet last run on db2030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:05:05] RECOVERY - puppet last run on db2028 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:05:05] RECOVERY - puppet last run on mc2010 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:05:05] RECOVERY - puppet last run on mw2121 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:05:54] RECOVERY - puppet last run on mw2124 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:11:15] (03PS1) 10Muehlenhoff: Add ferm rules for jmxtrans on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/229371 [12:14:41] (03PS1) 10Muehlenhoff: Enable base::firewall on analytics1021 [puppet] - 10https://gerrit.wikimedia.org/r/229374 [12:15:50] (03PS1) 10Muehlenhoff: Enable base::firewall on the remaining kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/229375 [12:32:43] !log upgrading asw-c-codfw and asw-d-codfw to newer junos [12:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:15] jynus: ^^^ [12:34:15] lol [12:34:24] an alter table takes 4 hours [12:34:51] I started mine a lot of hours ago, before your first warnings [12:35:23] go ahead, is not affecting me [12:41:19] !log restarted HHVM on canary appservers for tidy/pcre security updates, remaining app servers following soon [12:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:41:39] (03CR) 10Mark Bergsma: [C: 04-2] "A new deployment system is under development by the Release Engineering team. It's already problematic that RESTbase would be using a one-" [puppet] - 10https://gerrit.wikimedia.org/r/229306 (https://phabricator.wikimedia.org/T107532) (owner: 10GWicke) [13:02:18] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Exclude Flow topic boards and Draft NS from Special:UnconnectedPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229197 (https://phabricator.wikimedia.org/T107927) (owner: 10Aude) [13:04:27] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Add config for Wikisource badges on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229062 (https://phabricator.wikimedia.org/T97014) (owner: 10Aude) [13:05:20] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1510569 (10Reedy) There's still a handful of production used machines still running PHP 5.3 If they get (all) upgraded to trusty, but not switched to hhvm, the... [13:05:25] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Update Wikibase site id and group for test2wiki and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [13:15:25] PROBLEM - Host mw2171 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:25] PROBLEM - Host cp2020 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:25] PROBLEM - Host mw2170 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:25] PROBLEM - Host mw2172 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:25] PROBLEM - Host mw2176 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:35] PROBLEM - Host mw2202 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:35] PROBLEM - Host mw2154 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:35] PROBLEM - Host mw2164 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:35] PROBLEM - Host mw2175 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:35] PROBLEM - Host mw2150 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:35] PROBLEM - Host mw2169 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:44] PROBLEM - Host mw2151 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:45] PROBLEM - Host mw2205 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:45] PROBLEM - Host db2034 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:45] PROBLEM - Host mw2211 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:45] PROBLEM - Host mw2156 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:45] PROBLEM - Host wtp2012 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:45] PROBLEM - Host cp2021 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:46] PROBLEM - Host cp2014 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:46] PROBLEM - Host mw2160 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:47] PROBLEM - Host cp2015 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:47] PROBLEM - Host mw2149 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:48] PROBLEM - Host mw2184 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:48] PROBLEM - Host mw2214 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:49] PROBLEM - Host mw2158 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:55] DON'T WORRY [13:15:59] switches rebooting. [13:16:01] PROBLEM - Host cp2024 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:01] PROBLEM - Host mw2155 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:02] PROBLEM - Host ms-be2010 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:02] PROBLEM - Host mw2209 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:03] PROBLEM - Host mw2189 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:03] PROBLEM - Host mw2163 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:04] PROBLEM - Host mw2181 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:04] PROBLEM - Host mw2201 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:05] PROBLEM - Host mw2187 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:05] PROBLEM - Host mw2178 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:06] PROBLEM - Host mw2197 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:06] PROBLEM - Host mw2191 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:07] PROBLEM - Host cp2017 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:07] PROBLEM - Host mw2207 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:08] PROBLEM - Host mw2194 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:08] :-) [13:16:14] PROBLEM - Host mc2013 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:14] PROBLEM - Host mw2182 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:14] PROBLEM - Host wtp2015 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:14] PROBLEM - Host mw2196 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:14] PROBLEM - Host mw2203 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:15] !log Removed Wikidata JSON dumps from Monday and Tuesday as they were incomplete/ had the wrong serialization format [13:16:15] PROBLEM - Host wtp2018 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:15] PROBLEM - Host mw2165 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:16] PROBLEM - Host mw2188 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:16] PROBLEM - Host mw2180 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:17] PROBLEM - Host mw2168 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:09] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1510590 (10Jgreen) In the parallel case of the mailhouse Silverpop for fundraising donor appeal mail, we have subdomain-specific SPF records and as I k... [13:23:14] http://thewpvaletcom.c.presscdn.com/wp-content/uploads/2014/02/website-down-dont-panic.jpg [13:24:56] (03CR) 10Ottomata: Add ferm rules for jmxtrans on Kafka brokers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229371 (owner: 10Muehlenhoff) [13:26:34] and finally a slave on codfw fails after retying 86400 times! :-) [13:27:08] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1510597 (10BBlack) The problem with that approach is that eventdonations is a CNAME record: ``` eventdonations 1H IN CNAME contrib-wi-10109-10472... [13:27:20] (03CR) 10Ottomata: [C: 031] Enable base::firewall on analytics1021 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229374 (owner: 10Muehlenhoff) [13:27:33] moritzm: joal, shall we do analytics1027? [13:29:15] ottomata: sounds good to me, whenever joal is also ready [13:29:42] (03PS2) 10Muehlenhoff: Add ferm rules for jmxtrans on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/229371 [13:32:09] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1510611 (10Jgreen) >>! In T107940#1510597, @BBlack wrote: > The problem with that approach is that eventdonations is a CNAME record: > > ``` > eventdo... [13:33:09] (03CR) 10BBlack: "Also, the variant in the previous patch for the portals doesn't seem to work. I get a page that just says "Invalid parameters..." when I " [puppet] - 10https://gerrit.wikimedia.org/r/229219 (owner: 10GWicke) [13:35:25] PROBLEM - puppet last run on db2045 is CRITICAL Puppet has 1 failures [13:35:57] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1510628 (10Jgreen) >>! In T107940#1510611, @Jgreen wrote: >>>! In T107940#1510597, @BBlack wrote: >> The problem with that approach is that eventdonati... [13:36:14] PROBLEM - puppet last run on mw2163 is CRITICAL Puppet has 1 failures [13:36:24] PROBLEM - puppet last run on mw2188 is CRITICAL Puppet has 1 failures [13:36:45] PROBLEM - puppet last run on mw2176 is CRITICAL puppet fail [13:37:59] (03CR) 10Giuseppe Lavagetto: [C: 032] "Makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/229194 (owner: 10Smalyshev) [13:38:08] ottomata, moritzm : Ready ! [13:38:16] ha, moritzm, gotta be nitpicky! your other jmx port ferm rule in that same class uses a hyphen, and this one uses an underscore! :) [13:38:47] the jmxtrans-jmx one [13:38:48] :) [13:38:55] ah ok, cool, joal is here, so lets proceed moritzm :) [13:39:06] sorry for the delay folks :) [13:39:10] paravoid: Ok to scap? Yesterday's train deploy didn't properly scap localization :/ [13:39:25] ottomata: I'll fix that up once we've flipped analytics1027 :-) [13:39:29] merging now [13:39:34] (03PS2) 10Muehlenhoff: All running services are now ferm-enabled, so turn enable base::firewall on analytics1027. [puppet] - 10https://gerrit.wikimedia.org/r/229147 (https://phabricator.wikimedia.org/T83597) [13:39:49] (03PS5) 10Giuseppe Lavagetto: T105080: add maintenance mode configs for nginx [puppet] - 10https://gerrit.wikimedia.org/r/228140 (owner: 10Smalyshev) [13:40:04] (03CR) 10Muehlenhoff: [C: 032 V: 032] All running services are now ferm-enabled, so turn enable base::firewall on analytics1027. [puppet] - 10https://gerrit.wikimedia.org/r/229147 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [13:40:42] merged, forcing a puppet run on analytics1027 [13:41:11] 6operations, 7discovery-system: Remove etcd1001,2 from the etcd cluster, decommission them. - https://phabricator.wikimedia.org/T108010#1510640 (10Joe) [13:42:39] joal, ottomata: puppet run completed [13:43:13] moritzm, ottomata : so far so good [13:43:17] 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 7Elasticsearch: Use fixed ports for elasticsearch - https://phabricator.wikimedia.org/T107278#1510642 (10chasemp) Technically not all elastic boxes have this applied yet. The restart process is drawn out, but I will reopen here if there are issues as I kn... [13:43:38] Will keep monitoring and let you know (next big run luanches is in about 20 minutes) [13:43:45] joal: ok! [13:44:25] PROBLEM - Restbase endpoints health on xenon is CRITICAL: /page/html/{title} is CRITICAL: Test Get html by title from Parsoid returned the unexpected status 500 (expecting: 200): /page/html/{title} is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/data-parsoid/{title} is CRITICAL: Test Get data-parsoid by title returned the unexpected status 500 (expecting: 200): /page/revision [13:44:35] RECOVERY - puppet last run on db2059 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:46:05] RECOVERY - puppet last run on db2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:46:13] (03PS6) 10Giuseppe Lavagetto: T105080: add maintenance mode configs for nginx [puppet] - 10https://gerrit.wikimedia.org/r/228140 (owner: 10Smalyshev) [13:46:15] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /page/html/{title} is CRITICAL: Test Get html by title from Parsoid returned the unexpected status 500 (expecting: 200): /page/html/{title} is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/data-parsoid/{title} is CRITICAL: Test Get data-parsoid by title returned the unexpected status 500 (expecting: 200): /page/r [13:46:34] PROBLEM - Cassanda CQL query interface on praseodymium is CRITICAL: Connection refused [13:46:35] PROBLEM - Restbase endpoints health on cerium is CRITICAL: /page/html/{title} is CRITICAL: Test Get html by title from Parsoid returned the unexpected status 500 (expecting: 200): /page/html/{title} is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/data-parsoid/{title} is CRITICAL: Test Get data-parsoid by title returned the unexpected status 500 (expecting: 200): /page/revisio [13:46:35] PROBLEM - Cassandra database on praseodymium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [13:46:54] RECOVERY - puppet last run on cp2014 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [13:46:54] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [13:46:54] RECOVERY - puppet last run on mw2163 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:46:55] RECOVERY - puppet last run on mw2188 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:47:05] RECOVERY - puppet last run on db2054 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:47:11] (03CR) 10Giuseppe Lavagetto: [C: 032] "This is ok only temporarily while we prepare you an efficient way to pool/depool hosts, which will arrive pretty soon." [puppet] - 10https://gerrit.wikimedia.org/r/228140 (owner: 10Smalyshev) [13:47:25] RECOVERY - puppet last run on db2065 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [13:47:25] RECOVERY - puppet last run on cp2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:47:25] RECOVERY - puppet last run on db2047 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [13:47:26] RECOVERY - puppet last run on mw2176 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:48:15] RECOVERY - puppet last run on wtp2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:48:54] joal, ottomata: I've enabled some logging for dropped packets and so far nothing appeared [13:48:54] RECOVERY - puppet last run on mw2203 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [13:49:05] RECOVERY - puppet last run on db2038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:49:05] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:49:05] RECOVERY - puppet last run on mw2182 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:49:14] RECOVERY - puppet last run on cp2022 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:49:32] 6operations, 7Monitoring: collect per-service cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027#1510650 (10fgiunchedi) 3NEW a:3fgiunchedi [13:49:35] RECOVERY - puppet last run on db2057 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:49:36] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1510658 (10BBlack) >>! In T107940#1510628, @Jgreen wrote: > For silverpop email.donate.wikimedia.org has an A record pointing to 208.80.154.224 (text-l... [13:49:44] RECOVERY - puppet last run on mc2014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:49:44] RECOVERY - puppet last run on ms-be2010 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [13:49:48] (03PS5) 10Filippo Giunchedi: diamond: add upstart/systemd service stats [puppet] - 10https://gerrit.wikimedia.org/r/224093 (https://phabricator.wikimedia.org/T108027) [13:49:54] moritzm: oozie has not yet launched a job on the cluster --> when this happens without failure, I'll stop bothering :-P [13:50:01] alerts by lag on codfw are expected [13:50:13] (03PS5) 10Filippo Giunchedi: diamond: service stats puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/224094 (https://phabricator.wikimedia.org/T108027) [13:50:16] (I logged them some time ago) [13:50:24] RECOVERY - puppet last run on db2061 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:50:24] RECOVERY - puppet last run on mw2168 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:50:25] RECOVERY - puppet last run on mw2172 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [13:50:26] will ack them [13:50:52] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Fix rules.log error when starting Blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229150 (owner: 10Smalyshev) [13:50:55] RECOVERY - puppet last run on mw2174 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:50:55] RECOVERY - puppet last run on db2037 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:50:55] RECOVERY - puppet last run on mw2150 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:51:14] RECOVERY - puppet last run on db2041 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:51:15] RECOVERY - puppet last run on mw2152 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:51:15] RECOVERY - puppet last run on cp2019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:51:24] RECOVERY - puppet last run on cp2017 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [13:51:45] RECOVERY - puppet last run on cp2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:51:45] RECOVERY - puppet last run on mw2200 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:51:45] RECOVERY - puppet last run on mw2202 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [13:53:55] RECOVERY - puppet last run on mw2167 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:53:55] RECOVERY - puppet last run on db2050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:53:55] RECOVERY - puppet last run on mw2193 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:53:56] RECOVERY - puppet last run on mw2186 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:54:45] RECOVERY - puppet last run on mw2209 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:58:25] (03PS3) 10Muehlenhoff: Add ferm rules for jmxtrans on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/229371 [13:59:28] oh hm, moritzm, this is not about this patch, but you can for sure make the kafka-jmx rule also just $INTERNAL, [13:59:43] if I tunnel through iron, will it look internal? or all? [14:03:21] ottomata: it should look internal, I'll update the kafka-jmx rule [14:04:04] k [14:04:32] moritzm: likely it will be fine to do internal for kafka too, but since we are going to further restrict that to specific hosts later anyway, that is up to you [14:04:39] moritzm: oozie jobs launched, no issue on my side [14:05:18] ottomata: let's skip that for since we'll be locking it down even further soon anyway [14:05:19] Thanks for the upgrade moritzm :) [14:05:23] joal: great, thanks! [14:06:46] (03PS2) 10Muehlenhoff: Enable base::firewall on analytics1021 [puppet] - 10https://gerrit.wikimedia.org/r/229374 [14:07:00] 6operations, 10RESTBase-Cassandra: upgrade RESTBase cluster to Cassandra 2.1.8 - https://phabricator.wikimedia.org/T107949#1510679 (10Eevans) >>! In T107949#1510274, @fgiunchedi wrote: > proposed plan: > * upgrade cassandra to 2.1.8 via deb upgrades on the staging cluster > * benchmark/stresstest > * upload pa... [14:17:56] there's an unusual drop in pageview/min at around 14:03 : https://gdash.wikimedia.org/dashboards/reqsum/ [14:18:00] ? [14:18:39] or you could interpret that as a strange little spike/plateau from ~13:00 to 14:03 [14:19:26] on mobile it's more pronounced at looks more like an increase event from 13:00->14:00-ish [14:19:49] (03CR) 10Ottomata: [C: 031] Enable base::firewall on analytics1021 [puppet] - 10https://gerrit.wikimedia.org/r/229374 (owner: 10Muehlenhoff) [14:20:41] RECOVERY - Cassandra database on praseodymium is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [14:21:42] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [14:22:00] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [14:22:01] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [14:22:31] RECOVERY - Cassanda CQL query interface on praseodymium is OK: TCP OK - 0.010 second response time on port 9042 [14:26:52] (03PS4) 10Dzahn: Add ferm rules for jmxtrans on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/229371 (owner: 10Muehlenhoff) [14:27:32] (03PS1) 10Jcrespo: Repool db1059 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229384 [14:28:07] (03CR) 10Jcrespo: [C: 032] Repool db1059 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229384 (owner: 10Jcrespo) [14:29:21] !log jynus Synchronized wmf-config/db-eqiad.php: repool db1059 (duration: 00m 13s) [14:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:29:56] (03CR) 10Dzahn: [C: 032] "ack, 2101 = jmxtrans and checked on analytics1021" [puppet] - 10https://gerrit.wikimedia.org/r/229371 (owner: 10Muehlenhoff) [14:32:08] (03CR) 10Dzahn: [C: 031] "we activated it yesterday on mc1009 and it's been fine, nothing got dropped here" [puppet] - 10https://gerrit.wikimedia.org/r/227418 (owner: 10Muehlenhoff) [14:33:02] (03CR) 10Dzahn: "and number of connections on mc1009 has been between 600 and 1000 (where 64k would be our limit for connection tracking)" [puppet] - 10https://gerrit.wikimedia.org/r/227418 (owner: 10Muehlenhoff) [14:36:49] (03PS2) 10Muehlenhoff: Enable ferm for remaining mc1* systems [puppet] - 10https://gerrit.wikimedia.org/r/227418 [14:36:57] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm for remaining mc1* systems [puppet] - 10https://gerrit.wikimedia.org/r/227418 (owner: 10Muehlenhoff) [14:37:00] (03CR) 10GWicke: "Mark, we have been working with release engineering for 4+ months now. Marko has attended each of the deployment cabal meetings, we have d" [puppet] - 10https://gerrit.wikimedia.org/r/229306 (https://phabricator.wikimedia.org/T107532) (owner: 10GWicke) [14:39:42] (03PS1) 10Jcrespo: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229386 [14:40:35] (03PS1) 10Filippo Giunchedi: cassandra: remove obsolete diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/229387 (https://phabricator.wikimedia.org/T78514) [14:41:28] (03CR) 10Jcrespo: [C: 032] Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229386 (owner: 10Jcrespo) [14:42:36] !log jynus Synchronized wmf-config/db-eqiad.php: depool db1056 (duration: 00m 12s) [14:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:08] !log restarted HHVM on appservers (mw1026-mw1113) for tidy/pcre security updates [14:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:30] (03PS1) 10Dzahn: bacula: remove fileset for blog [puppet] - 10https://gerrit.wikimedia.org/r/229388 [14:48:11] PROBLEM - puppet last run on cp2023 is CRITICAL puppet fail [14:48:49] 6operations, 7Database: duplicate key error on db1056 - https://phabricator.wikimedia.org/T108033#1510743 (10jcrespo) 3NEW a:3jcrespo [14:48:55] ottomata: is this still a thing since blog moved out of our infrastructure? include statistics::cron_blog_pageviews [14:49:35] mutante: no idea :/ [14:49:37] (03CR) 10Muehlenhoff: "We'll withhold merging this until the current Kafka migration is completed." [puppet] - 10https://gerrit.wikimedia.org/r/229374 (owner: 10Muehlenhoff) [14:50:30] (03Abandoned) 10Muehlenhoff: Enable base::firewall on the remaining kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/229375 (owner: 10Muehlenhoff) [14:51:08] ottomata: ok, will ask tbayer (he appears as a $recipient_email in there) [14:51:17] !log stopped restbase on restbase1009 [14:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:59] aye, thanks [14:53:06] (03CR) 10Alexandros Kosiaris: [C: 032] bacula: remove fileset for blog [puppet] - 10https://gerrit.wikimedia.org/r/229388 (owner: 10Dzahn) [14:53:08] (03PS1) 10Dzahn: statistics: remove blog pageviews script [puppet] - 10https://gerrit.wikimedia.org/r/229390 [14:53:12] (03PS2) 10Alexandros Kosiaris: bacula: remove fileset for blog [puppet] - 10https://gerrit.wikimedia.org/r/229388 (owner: 10Dzahn) [14:53:16] (03CR) 10Alexandros Kosiaris: [V: 032] bacula: remove fileset for blog [puppet] - 10https://gerrit.wikimedia.org/r/229388 (owner: 10Dzahn) [14:54:11] PROBLEM - Restbase root url on restbase1009 is CRITICAL - Socket timeout after 10 seconds [14:55:36] (03CR) 10Dzahn: "should have a +1 from Tilman" [puppet] - 10https://gerrit.wikimedia.org/r/229390 (owner: 10Dzahn) [14:59:08] (03CR) 10Jcrespo: [C: 031] "I'm ok, labs people should be pinged just before deployment." [puppet] - 10https://gerrit.wikimedia.org/r/228228 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150805T1500). [15:00:04] kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:15] jouncebot: yes sir. [15:00:30] kart_: I can SWAT this morning [15:00:46] That patch is for wmf17. Does it need submodule update? [15:00:49] thcipriani: ^^ [15:01:17] * thcipriani looks at .gitmodules [15:01:51] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1510797 (10Joe) I've took a bit of an alternative approach: - deploy behind misc-web, as query.wikidata.org - as logstash does, do not use lvs b... [15:01:57] 6operations, 10vm-requests: request VM for grafana - https://phabricator.wikimedia.org/T107832#1510798 (10Dzahn) @Faidon ok if we go with a jessie VM then per above ^ ? [15:02:01] !log rebooting labvirt1009 [15:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:18] kart_: looks like it should bump submodule automagically [15:02:52] ok. So, just merge. [15:03:09] I mean: I will just merge it. [15:04:03] thcipriani: Is that fine? [15:04:13] kart_: sure, for for it if you have +2 there [15:04:23] Yes. it is CX :) [15:04:28] s/for for/go for/ [15:07:16] 6operations, 10vm-requests: request VM for grafana - https://phabricator.wikimedia.org/T107832#1510836 (10faidon) A VM just for Grafana sounds a bit excessive to me. I could see the point for a more generic webserver, potentially hosting the other Graphite frontends too, as well as other one-off simple web ser... [15:07:40] thcipriani: thanks. we're good. [15:07:57] thcipriani: you can update submodule. [15:08:10] (oh. let it merge) :) [15:08:14] kart_: yup, pulling down changes on tin now [15:10:52] <_joe_> grrrit-wm: why are you silent? [15:11:41] !log thcipriani Synchronized php-1.26wmf17/extensions/ContentTranslation/modules/tools/ext.cx.tools.mt.js: SWAT: FIX: Not able to set cursor in previous sections [[gerrit:229328]] (duration: 00m 12s) [15:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:53] ^ kart_ check please [15:12:25] PROBLEM - puppet last run on restbase1003 is CRITICAL Puppet has 1 failures [15:12:36] thcipriani: okay! [15:13:54] RECOVERY - puppet last run on cp2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:22] thcipriani: cool. Thanks! Working as expected. [15:14:32] kart_: awesome, thanks! [15:15:02] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Assign an LVS service to the wikidata query service - https://phabricator.wikimedia.org/T107601#1510856 (10Joe) I currently decided not to use an LVS loadbalancer for this, since it's being put behind... [15:15:41] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Assign an LVS service to the wikidata query service - https://phabricator.wikimedia.org/T107601#1510857 (10Joe) 5Open>3declined [15:19:35] RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 15145 bytes in 0.008 second response time [15:21:06] RECOVERY - puppet last run on restbase1003 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:21:32] 6operations, 6WMF-Legal: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1510880 (10Dzahn) [15:25:14] 6operations, 6WMF-Legal: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1510886 (10Dzahn) Any updates here? There are some concerns about WMF building microsites outside of the regular wikis at https://meta.wikimedia.org/wiki/Foundation_wiki_feedback#Secessionism [15:30:17] 6operations, 10vm-requests: request VM for grafana - https://phabricator.wikimedia.org/T107832#1510902 (10fgiunchedi) +1 to a generic webserver, I don't feel that strongly about not colocating graphite and grafana ATM so that's fine by me as well [15:35:42] PROBLEM - puppet last run on iridium is CRITICAL puppet fail [15:43:48] 6operations, 10ops-eqiad: db1059 raid degraded - https://phabricator.wikimedia.org/T107024#1510919 (10Cmjohnson) Disk replaced and is rebuilding. Return Shipment Information USPS 9202 3946 5301 2428 1810 99 FEDEX 9611918 2393026 49672779 [15:50:34] 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1510943 (10BBlack) So the affected machines by-cluster: - mobile - cp1046 - cp1059 - cp1060 - upload - cp1061 - cp1062 - cp1064 - text - cp... [15:53:45] 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1510946 (10BBlack) (lists above edited, I had mistakenly used 106[78] when it should have been 106[67] in the text cluster...) [15:55:03] RECOVERY - puppet last run on iridium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:40] 6operations, 10vm-requests: request VM for grafana - https://phabricator.wikimedia.org/T107832#1510951 (10Dzahn) We have a generic webserver VM already, one is for "misc static HTML services" (bromine) and the other "misc PHP apps" (krypton). We could use krypton then. I just assumed we want grafana more separ... [15:58:05] 6operations, 10vm-requests: request VM for grafana - https://phabricator.wikimedia.org/T107832#1510953 (10Dzahn) a:3Dzahn [15:59:14] 6operations, 10Traffic: Stop using LVS from varnishes - https://phabricator.wikimedia.org/T107956#1510955 (10GWicke) @bblack, a //simple// implementation of a rolling deploy across hundreds of servers using a sliding window of x% of servers involves many individual depools / re-pools, one of each per server. E... [16:00:29] apergos: there’s some traffic on labs-l talking about dumps (and about your work on dumps) — can you respond? [16:00:39] yep [16:00:47] already have it in my queue [16:00:53] andrewbogott: [16:01:04] thanks [16:01:25] 6operations, 10ops-eqiad: db1059 raid degraded - https://phabricator.wikimedia.org/T107024#1510961 (10jcrespo) Hey, @Cmjohnson. Thank you very much! And double thank you for the heads-up, sometimes, when rebuilding the disk, we observe a small performance degradation, and knowing this is very helpful. [16:06:23] RECOVERY - DPKG on labvirt1005 is OK: All packages OK [16:07:23] 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1511000 (10BBlack) Confirmed: throttle/temp data still looks like it did before, other than cp1065, which still looks like it's fine after the ther... [16:12:41] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1511031 (10RobH) [16:13:30] Is it ok to run a scap to fix yesterday's train? [16:14:03] twentyafterfour: ^ [16:18:06] grrrit-wm: ?? [16:19:21] greg-g: I also need another set of backports today... the json dump script is still broken and we another maintenance script which is broken [16:19:44] I hope to find someone to review :/ [16:20:24] ori: 'beresp.do_esi': cannot be set in method 'vcl_recv' [16:20:36] is grrrit-wm misbehaving? [16:22:32] apparently [16:22:38] yeah [16:22:56] there we go [16:23:21] hmm [16:23:24] that didn't work [16:23:54] PROBLEM - puppet last run on cp1065 is CRITICAL Puppet has 1 failures [16:24:51] i was just popping in here to see about grrrit-wm [16:25:08] is that hashar's baby? i forget who's the keeper of that flame. [16:25:26] I'm fiddling with it now [16:25:29] (03PS2) 10ArielGlenn: dumps: correct number of workers to run at once [puppet] - 10https://gerrit.wikimedia.org/r/229410 [16:25:31] (03PS2) 10Alex Monk: maintain-replicas: Do not record centralauth in meta_p.wiki [software] - 10https://gerrit.wikimedia.org/r/221042 (https://phabricator.wikimedia.org/T101750) [16:25:37] cscott, fixed [16:25:57] cscott: Yuvi wrote it, but he's not the only person able to poke it, see ^ [16:26:06] cscott: not hashar, no. not sure who's baby it is [16:26:12] RECOVERY - puppet last run on cp1065 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:17] grrrit-wm: !lastmerge operations/puppet [16:26:22] ^ wishlist :) [16:26:24] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1511084 (10RobH) Meeting notes from today: https://etherpad.wikimedia.org/p/mailman-Aug-2015 [16:26:28] bblack: nice [16:26:39] bblack: To just report HEAD? [16:26:50] I wonder, does the phabot do anything cool like that? [16:26:59] yes, because we need more convoluted IRC interfaces to stuff we could totally check in another window :) [16:27:13] but.... then you have to leave IRC! [16:27:22] That ^ [16:27:28] :) [16:27:29] (03CR) 10ArielGlenn: [C: 032] dumps: correct number of workers to run at once [puppet] - 10https://gerrit.wikimedia.org/r/229410 (owner: 10ArielGlenn) [16:28:21] !log cache puppets disabled for a little while, to make sure do_esi doesn't melt things [16:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:13] (03CR) 10BBlack: [C: 032] "It's been a few days, nobody's screaming yet, and this is easy to revert" [dns] - 10https://gerrit.wikimedia.org/r/228029 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [16:33:31] the us military seems to love being our edge-case :P [16:33:56] dns22.dfas.mil - first host I see still accessing load.php over the decommed bits IPs that aren't legally referenced anywhere :P [16:35:11] (03CR) 10Mobrovac: [C: 031] update cassandra-metrics-collector to latest [puppet] - 10https://gerrit.wikimedia.org/r/229401 (https://phabricator.wikimedia.org/T97024) (owner: 10Eevans) [16:35:13] (apparently some things on the internet also think a TTL of "10 minutes" means "cache this days", but not very many) [16:35:45] "cache this for days"? [16:35:46] anyways [16:37:23] mutante: I'm not creating any of those tickets unless you ask me ;D [16:37:29] the ones on the etherpad notes [16:37:49] (03CR) 10GWicke: [C: 031] update cassandra-metrics-collector to latest [puppet] - 10https://gerrit.wikimedia.org/r/229401 (https://phabricator.wikimedia.org/T97024) (owner: 10Eevans) [16:43:08] !log hoo Started scap: Rebuild l10n cache for wmf17, got forgotten during the train [16:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:43:45] robh: stubborn :p [16:44:03] nah, i just dont wanna duplicate what he is doing. [16:44:12] i just suck at sounding supportive. [16:44:18] ;] [16:46:36] (03Abandoned) 10GWicke: Bump to 6ac383c [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/206612 (owner: 10GWicke) [16:47:26] (03CR) 10GWicke: [C: 031] htmldumper 0.1.0 with dependencies [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/204964 (https://phabricator.wikimedia.org/T94457) (owner: 10GWicke) [16:47:52] (03CR) 10GWicke: "@ArielGlenn, any news on this?" [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/204964 (https://phabricator.wikimedia.org/T94457) (owner: 10GWicke) [16:51:43] mark / paravoid: https://phabricator.wikimedia.org/T108057 [16:51:54] mutante: ^^ too :) [16:52:34] (03PS4) 10BBlack: Remove cache::bits role from bits-cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/228033 (https://phabricator.wikimedia.org/T95448) [16:52:51] (03PS5) 10BBlack: Remove cache::bits role from bits-cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/228033 (https://phabricator.wikimedia.org/T95448) [16:52:59] (03PS3) 10BBlack: Decom bits cluster varnish/lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/228034 (https://phabricator.wikimedia.org/T95448) [16:54:20] (03PS2) 10Mobrovac: Mathoid: Enable advanced enpoint monitoring [puppet] - 10https://gerrit.wikimedia.org/r/229398 (https://phabricator.wikimedia.org/T105775) [16:54:22] JohnFLewis: ok, great, let me add you to that special group to be able to read the NDA, the setup is a bit odd [16:55:01] mutante: okay, poke me when done :) [16:56:22] (03PS2) 10Chad: beta: Swap caches to deployment-cache-*04, which is jessie [puppet] - 10https://gerrit.wikimedia.org/r/227744 (https://phabricator.wikimedia.org/T98758) [16:56:25] (03PS2) 10Chad: beta: Swap caches to deployment-cache-*04, which is jessie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227743 (https://phabricator.wikimedia.org/T98758) [16:57:27] bblack: When you get a chance, would be nice to get in ^ [16:57:31] JohnFLewis: try now to access "L2" [16:58:55] mutante: yep signed [16:59:13] JohnFLewis: great, thanks [16:59:14] That was much easier than the old one :p [17:00:06] JohnFLewis: just a little, haha [17:02:18] !log depooled cp1046, cp1061, cp1066 ( thermal batch 1: https://phabricator.wikimedia.org/T103226 ) [17:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:06:30] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1511348 (10Dzahn) my notes on tasks: - create VMs (2, one staging one prod, but we can use the same hostname and reinstall) - export configs and archives for a few big lists from sodium,... [17:07:54] !log really depooled cp1046, cp1061, cp1066 ( thermal batch 1: https://phabricator.wikimedia.org/T103226 ) [17:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:08:41] 6operations, 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1511353 (10Andrew) labvirt1009 is now running 3.16.0-45-generic. A few tentative suspend/resumes suggest that all is well. If labvirt1009 i... [17:09:11] !log hoo Finished scap: Rebuild l10n cache for wmf17, got forgotten during the train (duration: 26m 02s) [17:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:09:20] Looks good :) [17:10:26] mutante: should i go through that process too ? [17:10:53] PROBLEM - Host cp1061 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:53] PROBLEM - Host cp1066 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:56] matanya: yes, i think it wouldn't hurt [17:11:20] ok, will do at some point [17:11:26] sorry, I was slow on the icinga downtimes! [17:11:31] those host downs expected [17:11:41] matanya: better now than waiting to be asked to do it later like me :) [17:11:50] Beat them to it ;) [17:12:01] yea, don't have tim atm, but maybe tomorrow [17:12:07] *time [17:13:32] PROBLEM - IPsec on cp3013 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:13:32] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:13:33] PROBLEM - IPsec on cp3014 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:13:43] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:13:53] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1046_v4, cp1046_v6 [17:14:02] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1046_v4, cp1046_v6 [17:14:02] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:14:02] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1046_v4, cp1046_v6 [17:14:03] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:14:07] heh [17:14:13] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:14:13] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:14:14] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:14:14] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:14:14] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:14:14] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:14:14] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:14:23] that's an interesting case for icinga downtime.... [17:14:23] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:14:24] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:14:33] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:14:34] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:14:34] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1046_v4, cp1046_v6 [17:14:34] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:14:43] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1046_v4, cp1046_v6 [17:14:43] do we go downtime ipsec on the other 60 affected hosts because these 3 are being rebooted? :P [17:14:43] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:14:43] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:14:43] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:14:43] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:14:52] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:14:53] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:14:53] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:14:53] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:14:53] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1046_v4, cp1046_v6 [17:14:53] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:14:53] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:14:59] !log restarted HHVM on appservers (mw1149-mw1151, mw1161-1188, mw1209-1220) for tidy/pcre security updates [17:15:02] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:15:02] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:15:02] PROBLEM - IPsec on cp3012 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:13] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:15:14] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:15:23] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:15:23] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:15:23] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:15:24] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:15:24] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1066_v4, cp1066_v6 [17:15:32] I'm not even sure what the right way would be to structure than in dependency terms [17:15:33] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:15:33] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1046_v4, cp1046_v6 [17:15:42] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:15:43] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1046_v4, cp1046_v6 [17:15:52] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:15:52] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:15:52] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1061_v4, cp1061_v6 [17:15:58] enjoy the spam for now, I guess [17:17:15] 6operations, 10Wikimedia-Mailing-lists: request: use spamassassin to filter as well - https://phabricator.wikimedia.org/T83030#1511414 (10Dzahn) [17:17:59] (I guess to have any hope of proper dependencies for host-down, we'd have to have separate checks for all the ipsec SAs, which would be a pretty big explosion in checks) [17:18:34] (but even then, I don't know that nagios/icinga has a concept that maps to "this service check on hostX is dependent on the state of hostX and hostY" [17:18:41] ) [17:18:49] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1511423 (10Dzahn) [17:19:28] bblack: it does https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/dependencies.html [17:19:38] but we did not implement it [17:22:14] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1511445 (10Dzahn) [17:22:16] 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1511444 (10Dzahn) [17:23:27] 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1511451 (10Dzahn) We will setup a new mailman install on jessie on a Ganeti VM instead. After that is done we will shut down sodium. So this will not be an upgrade of sodium itself anymore. re-naming ticket to reflect that. see pro... [17:28:09] (03CR) 10BBlack: [C: 032] beta: Swap caches to deployment-cache-*04, which is jessie [puppet] - 10https://gerrit.wikimedia.org/r/227744 (https://phabricator.wikimedia.org/T98758) (owner: 10Chad) [17:28:47] 6operations: integration.wikimedia.org redirect behavior is incorrect - https://phabricator.wikimedia.org/T84060#1511469 (10JanZerebecki) T83399 and T83381 happened, so it only remains to change funnel to redirect. [17:29:19] ostriches: beta jessie thing merged [17:30:34] 6operations, 10vm-requests: eqiad: 1 VM %request for mailman - https://phabricator.wikimedia.org/T108065#1511472 (10Dzahn) 3NEW [17:30:44] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1511479 (10Dzahn) [17:31:04] (03PS1) 10DCausse: Limit the number of states generated by a wildcard query [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229424 (https://phabricator.wikimedia.org/T102589) [17:31:29] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1511472 (10Dzahn) [17:32:48] 6operations, 10Traffic: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#1511497 (10BBlack) Coincident with and related to this: there might be future plans afoot (still ill-defined) to blend the cache clusters in general (as in: text + upload on same machines, without an... [17:33:14] (03CR) 10DCausse: "This setting needs elasticsearch 1.7.1 to be operational." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229424 (https://phabricator.wikimedia.org/T102589) (owner: 10DCausse) [17:33:50] 6operations: shutdown sodium after mailman has migrated to jessie VM - https://phabricator.wikimedia.org/T82698#1511499 (10Dzahn) [17:33:54] jouncebot: next [17:33:54] In 0 hour(s) and 26 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150805T1800) [17:34:43] 6operations: shutdown sodium after mailman has migrated to jessie VM - https://phabricator.wikimedia.org/T82698#904273 (10Dzahn) [17:34:45] 6operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#1511505 (10Dzahn) [17:35:29] (03CR) 10BBlack: [C: 032] beta: Swap caches to deployment-cache-*04, which is jessie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227743 (https://phabricator.wikimedia.org/T98758) (owner: 10Chad) [17:35:50] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1511513 (10Dzahn) [17:35:52] 6operations: shutdown sodium after mailman has migrated to jessie VM - https://phabricator.wikimedia.org/T82698#1511514 (10Dzahn) [17:36:38] 6operations, 10Wikimedia-Mailing-lists: mailman: centralize logging or create a mailman admin group - https://phabricator.wikimedia.org/T99734#1511517 (10Dzahn) [17:36:54] !log bblack Synchronized wmf-config/squid-labs.php: (no message) (duration: 00m 12s) [17:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:02] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 16 ESP OK [17:37:12] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 16 ESP OK [17:37:23] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 16 ESP OK [17:37:23] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 16 ESP OK [17:37:24] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 16 ESP OK [17:38:04] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 16 ESP OK [17:38:14] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 16 ESP OK [17:38:23] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 16 ESP OK [17:38:40] bblack: Thanks! [17:39:10] np [17:39:20] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1511534 (10Dzahn) [17:40:20] getting varnish error page on phab request [17:41:38] not me [17:42:03] PROBLEM - puppet last run on cp3009 is CRITICAL Puppet has 1 failures [17:42:55] ^ that's me, but it's ignorable [17:43:03] i got it once, then hit the "try again" link to resend my POST and it fails [17:43:09] but when i open new pages it is fine again now [17:43:42] it was a 503 for a moment [17:44:12] RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:46:32] bblack: Swapped the public IPs to point at the new instances, all seems well! [17:46:50] yay [17:46:59] 6operations: shutdown sodium after mailman has migrated to jessie VM - https://phabricator.wikimedia.org/T82698#1511578 (10Dzahn) [17:47:47] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1511579 (10demon) [17:47:47] (03PS1) 10JanZerebecki: Change docs and integration.m.o to rewrite [puppet] - 10https://gerrit.wikimedia.org/r/229426 (https://phabricator.wikimedia.org/T84060) [17:48:13] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1276698 (10demon) All seems working, just need to verify purges, make sure browser tests are still ok, and then decom the old instances. [17:48:14] !log es1.7.1: freeze indices (take 2) [17:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:48:20] ostriches: poking at text04 a bit, to see what the remaining puppet issues are [17:48:42] I was only having problems with mobile, but I don't think it was puppet's fault. [17:48:51] Also, still no HTTPS so I didn't even bother applying the tls roles. [17:49:01] (So we weren't plagued with eternal failures) [17:49:13] RECOVERY - Host cp1066 is UPING OK - Packet loss = 16%, RTA = 0.63 ms [17:49:14] PROBLEM - Restbase root url on restbase1009 is CRITICAL: Connection refused [17:49:24] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 32 ESP OK [17:49:33] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 32 ESP OK [17:49:33] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 32 ESP OK [17:49:33] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 32 ESP OK [17:49:43] RECOVERY - IPsec on cp3013 is OK: Strongswan OK - 32 ESP OK [17:49:43] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 32 ESP OK [17:49:43] RECOVERY - IPsec on cp3014 is OK: Strongswan OK - 32 ESP OK [17:49:54] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 32 ESP OK [17:49:56] ostriches: well one puppet issue on text04 was something retarded with nginx-vs-varnish package installs, which I thought I had universally fixed. I manually worked around that (one-shot of stop varnish-frontend and configure nginx-full package) [17:50:02] (03PS3) 10Giuseppe Lavagetto: Mathoid: Enable advanced enpoint monitoring [puppet] - 10https://gerrit.wikimedia.org/r/229398 (https://phabricator.wikimedia.org/T105775) (owner: 10Mobrovac) [17:50:08] (03PS1) 10Yuvipanda: labs: Enable scratch volume for k8s-eval project [puppet] - 10https://gerrit.wikimedia.org/r/229427 [17:50:13] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 32 ESP OK [17:50:23] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 32 ESP OK [17:50:24] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 32 ESP OK [17:50:24] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 32 ESP OK [17:50:27] (03CR) 10Giuseppe Lavagetto: [C: 032] Mathoid: Enable advanced enpoint monitoring [puppet] - 10https://gerrit.wikimedia.org/r/229398 (https://phabricator.wikimedia.org/T105775) (owner: 10Mobrovac) [17:50:30] ottomata: btw, https://gerrit.wikimedia.org/r/#/c/229265/1 [17:50:33] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 32 ESP OK [17:50:38] bblack: It was the way the old text02 was setup with the tls role to install nginx too and it blew up on ssl certs. [17:50:42] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 32 ESP OK [17:50:43] So I didn't bother applying that role. [17:50:43] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 32 ESP OK [17:50:52] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 32 ESP OK [17:51:02] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 32 ESP OK [17:51:02] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - 32 ESP OK [17:51:03] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 32 ESP OK [17:51:03] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 32 ESP OK [17:51:03] ostriches: there is no other role now, it's all wrapped in the cache role (the TLS stuff) [17:51:13] RECOVERY - IPsec on cp3012 is OK: Strongswan OK - 32 ESP OK [17:51:14] text04 was fine, then. [17:51:36] except that puppet was totally broken [17:51:44] still is, I'm trying to sort through the nginx<->varnish port 80 hell, etc [17:52:01] Ahhh, I see it now [17:52:03] ok [17:52:37] but yeah a missing private key is likely the root of it all [17:52:50] (03PS2) 10Yuvipanda: labs: Allow projects to opt into a 'statistics' NFS mount [puppet] - 10https://gerrit.wikimedia.org/r/229265 (https://phabricator.wikimedia.org/T107576) [17:53:01] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Allow projects to opt into a 'statistics' NFS mount [puppet] - 10https://gerrit.wikimedia.org/r/229265 (https://phabricator.wikimedia.org/T107576) (owner: 10Yuvipanda) [17:53:13] (03PS2) 10Yuvipanda: labs: Enable scratch volume for k8s-eval project [puppet] - 10https://gerrit.wikimedia.org/r/229427 [17:53:20] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Enable scratch volume for k8s-eval project [puppet] - 10https://gerrit.wikimedia.org/r/229427 (owner: 10Yuvipanda) [17:54:30] 6operations, 6Services, 5Patch-For-Review, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1511632 (10Joe) [17:55:11] gah, well half my problem is a bad cherry-picked varnish commit in the labs master, for the security audit backend :P [17:55:12] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1511634 (10Dzahn) [17:55:13] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1511633 (10Dzahn) [17:56:00] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1450894 (10Dzahn) [17:56:02] 6operations: shutdown sodium after mailman has migrated to jessie VM - https://phabricator.wikimedia.org/T82698#1511647 (10Dzahn) [17:56:03] bblack: addressed your concerns in https://gerrit.wikimedia.org/r/#/c/158016/ if you have any time to re-review [17:56:28] YuviPanda: cool, ja planning on working on that in a bit [17:56:30] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1450894 (10Dzahn) [17:56:45] ottomata: ok, I merged my patch anyway so it's ready to test when you merge yours :) [17:56:48] !log es1.7.1: restart elastic1011 [17:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:57:30] marxarelli: for now I un-cherry-picked that on deployment puppetmaster, it wasn't merging cleanly, whatever was picked there [17:58:02] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 42 ESP OK [17:58:02] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 42 ESP OK [17:58:11] bblack: ah, that may have been the old commit [17:58:13] RECOVERY - Host cp1061 is UPING OK - Packet loss = 0%, RTA = 1.11 ms [17:58:13] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 42 ESP OK [17:58:13] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 42 ESP OK [17:58:23] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 42 ESP OK [17:58:23] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 42 ESP OK [17:58:23] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 42 ESP OK [17:58:25] bblack: i'll check it out [17:58:52] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 42 ESP OK [17:58:53] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 42 ESP OK [17:58:53] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 42 ESP OK [17:58:53] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 42 ESP OK [17:58:57] k great, i mean, we'll need to get a VLAN ACL hole poked, YuviPanda [17:59:00] 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1511677 (10Cmjohnson) cp1046, cp1061 and cp1063 are complete. [17:59:03] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 42 ESP OK [17:59:20] ostriches: [17:59:21] bblack-mba:~ bblack$ curl -k -I https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [17:59:22] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 42 ESP OK [17:59:25] HTTP/1.1 200 OK [17:59:27] !! [17:59:32] ?!?!?!!!! [17:59:32] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 42 ESP OK [17:59:33] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 42 ESP OK [17:59:33] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 42 ESP OK [17:59:33] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 42 ESP OK [17:59:33] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 42 ESP OK [17:59:34] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 42 ESP OK [17:59:36] (cert is for wrong hostname or whatever, but still, it basically works) [17:59:45] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 42 ESP OK [17:59:45] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 42 ESP OK [17:59:53] just cert issues now [17:59:57] At least it works again! [18:00:03] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 42 ESP OK [18:00:04] twentyafterfour greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150805T1800). [18:00:19] (03PS1) 10GWicke: Update templates for RESTBase deploy [puppet] - 10https://gerrit.wikimedia.org/r/229429 [18:00:25] oh, -k / --insecure, got it :) [18:00:50] I thought you just worked some magic [18:00:57] before it wouldn't even start nginx or configure it right [18:01:01] it's a step in the right direction! :) [18:01:16] is it my crappy tethering or is https://test.wikidata.org/wiki/Special:Version giving a 502? [18:01:28] aude: same. [18:01:33] :( [18:01:33] ditto [18:01:49] i think hoo was going to do scap, because i18n was not up-to-date [18:01:53] what's test.wikidata.org? [18:01:56] probably not related [18:02:03] bblack: like test.wikipedia.org [18:02:14] and test2 [18:02:21] so many tests! [18:02:22] !log restarted HHVM on appservers (mw1136-mw1158) for tidy/pcre security updates [18:02:28] test.wikipedia also bad [18:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:02:39] probably group0 bad [18:02:49] twentyafterfour: ^ [18:02:58] test2 works [18:03:30] * aude sighs [18:04:56] aude: if I had to guess, it might be the do_esi stuff from this morning [18:05:02] looking into it in a few minutes... [18:05:18] also see a ton of Call to a member function getNamespace() on a non-object (NULL) (cirrus) exceptions [18:05:28] probably unrelated (in job queue) [18:05:50] 6operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1511687 (10atgo) [18:05:57] (03PS2) 10ArielGlenn: schedule stages of dumps to run in order on a given host [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/228809 [18:06:32] 6operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Unplanned-Sprint-Work: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1511688 (10DStrine) [18:07:20] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [18:07:32] RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.023 second response time [18:07:33] !log depooled cp1059, cp1062, cp1067 ( thermal batch 2: https://phabricator.wikimedia.org/T103226 ) [18:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:07:58] bblack: thanks [18:08:01] 6operations, 10Wikimedia-Mailing-lists: install jessie on new VM for mailman - https://phabricator.wikimedia.org/T108070#1511691 (10Dzahn) 3NEW [18:09:19] (03CR) 10Mobrovac: [C: 031] Update templates for RESTBase deploy [puppet] - 10https://gerrit.wikimedia.org/r/229429 (owner: 10GWicke) [18:11:01] 6operations, 10Wikimedia-Mailing-lists: export config and archive data from sodium - https://phabricator.wikimedia.org/T108071#1511709 (10Dzahn) 3NEW [18:11:22] 6operations, 6Analytics-Backlog, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1511716 (10Krinkle) [18:11:25] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1511715 (10Krinkle) [18:11:39] 6operations, 10Wikimedia-Mailing-lists: install jessie on new VM for mailman - https://phabricator.wikimedia.org/T108070#1511718 (10Dzahn) [18:12:39] I'm seeing that a part of my API queries is getting 502 errors on enwiki, commons, dewiki, (runs on Tools) [18:13:00] 6operations, 10Wikimedia-Mailing-lists: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1511728 (10Dzahn) 3NEW [18:13:41] 6operations, 10Wikimedia-Mailing-lists: test importing of mailing list configs and archives on staging VM - https://phabricator.wikimedia.org/T108073#1511728 (10Dzahn) try if we can get way with just rsyncing everything, including mbox files and HTML archives, _without_ having to regenerate HTML archives to av... [18:13:42] so maybe this isn't testwiki-specific? [18:14:11] (03CR) 10GWicke: "Note: Please do not merge until puppet is disabled on production nodes, and the RESTBase deploy is ready." [puppet] - 10https://gerrit.wikimedia.org/r/229429 (owner: 10GWicke) [18:14:18] sitic: can you give an example URL I can hit? [18:14:40] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1059_v4, cp1059_v6 [18:15:01] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:15:01] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:15:11] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:15:11] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:15:11] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:15:11] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:15:30] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:15:30] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:15:30] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:15:31] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:15:31] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:15:31] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:15:31] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1059_v4, cp1059_v6 [18:15:32] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:15:32] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:15:33] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:15:34] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:15:48] it is probably not wikidata sepcific, as https://test.wikipedia.org/wiki/Main_Page gives 502 [18:15:50] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:15:51] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:15:51] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:15:51] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:15:51] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:15:53] bblack: https://test.wikipedia.org/wiki/Kitten ir [18:15:54] or [18:16:00] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:16:01] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1059_v4, cp1059_v6 [18:16:01] PROBLEM - IPsec on cp3013 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:16:01] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:16:01] PROBLEM - IPsec on cp3014 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:16:01] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:16:01] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:16:02] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1059_v4, cp1059_v6 [18:16:04] https://test.wikidata.org/wiki/Special:Version is what i was looking at [18:16:11] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1059_v4, cp1059_v6 [18:16:18] yeah but it may not be test-specific either, sitic said "I'm seeing that a part of my API queries is getting 502 errors on enwiki, commons, dewiki, (runs on Tools)" [18:16:20] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:16:21] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:16:21] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:16:21] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1059_v4, cp1059_v6 [18:16:21] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:16:29] i wouldn't rule out wikibase something [18:16:31] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1059_v4, cp1059_v6 [18:16:31] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:16:38] but don't know [18:16:39] bblack: I'm quering the watchlist with timestamp as parameter, so it's always different (don't have a complete url) [18:16:41] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:16:41] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:16:41] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:16:41] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:16:41] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1059_v4, cp1059_v6 [18:16:50] PROBLEM - IPsec on cp3012 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:16:50] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:16:50] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:16:50] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:16:50] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:16:50] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1062_v4, cp1062_v6 [18:16:57] sitic: can you just give me a random example? [18:17:07] second [18:17:07] 6operations, 10Wikimedia-Mailing-lists: export config and archive data from sodium - https://phabricator.wikimedia.org/T108071#1511743 (10Dzahn) [18:17:11] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp1067_v4, cp1067_v6 [18:17:48] i'm not seeing the problem on en.wp for random pages and special pages [18:19:05] ah got it with curl to varnish directly: [18:19:07] root@cp1065:~# curl -vI -H 'X-Forwarded-Proto: https' -H 'Host: test.wikipedia.org' http://localhost:80/wiki/Main_Page [18:19:10] * Hostname was NOT found in DNS cache [18:19:13] * Trying ::1... [18:19:15] * Connected to localhost (::1) port 80 (#0) [18:19:18] > HEAD /wiki/Main_Page HTTP/1.1 [18:19:20] bblack: I see 502 result in the logs for https://en.wikipedia.org/w/api.php?action=query&meta=userinfo&uiprop=rights but in curl everything works [18:19:20] > User-Agent: curl/7.38.0 [18:19:23] > Accept: */* [18:19:25] > X-Forwarded-Proto: https [18:19:28] > Host: test.wikipedia.org [18:19:30] > [18:19:33] * Empty reply from server [18:19:35] * Connection #0 to host localhost left intact [18:19:38] curl: (52) Empty reply from server [18:19:40] that looks like a do_esi issue [18:19:41] as i said, also no problem on test2.wikipedia.org (which is group 0 also) [18:19:46] let me revert do_esi, see what that does for test, then look back at that [18:19:58] the do_esi change was specific to hostnames matching '^test\.' [18:20:05] bblack: i see [18:20:28] (03PS1) 10BBlack: Revert "bugfix for 4bc472f9: beresp.do_esi must be set in fetch, not recv" [puppet] - 10https://gerrit.wikimedia.org/r/229432 [18:20:32] (03PS2) 10BBlack: Revert "bugfix for 4bc472f9: beresp.do_esi must be set in fetch, not recv" [puppet] - 10https://gerrit.wikimedia.org/r/229432 [18:20:37] (03CR) 10BBlack: [C: 032 V: 032] Revert "bugfix for 4bc472f9: beresp.do_esi must be set in fetch, not recv" [puppet] - 10https://gerrit.wikimedia.org/r/229432 (owner: 10BBlack) [18:20:46] (03PS1) 10BBlack: Revert "Enable ESI for testwiki" [puppet] - 10https://gerrit.wikimedia.org/r/229433 [18:20:53] (03PS2) 10BBlack: Revert "Enable ESI for testwiki" [puppet] - 10https://gerrit.wikimedia.org/r/229433 [18:21:00] (03CR) 10BBlack: [C: 032 V: 032] Revert "Enable ESI for testwiki" [puppet] - 10https://gerrit.wikimedia.org/r/229433 (owner: 10BBlack) [18:21:34] (03PS1) 10ArielGlenn: dumps: call to findandlocknextwiki was missing an argument [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/229434 [18:22:42] salting puppet for the reversion now... [18:23:30] 502 on test.wikidata is gone [18:23:59] so it appears that you fixed it: https://test.wikidata.org/wiki/Special:Version [18:24:22] can I go ahead with the train deploy then? [18:24:35] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1511765 (10Dzahn) [18:24:36] 6operations, 10Wikimedia-Mailing-lists: install jessie on new VM for mailman - https://phabricator.wikimedia.org/T108070#1511764 (10Dzahn) [18:24:46] yeah that was it [18:24:52] bblack: looks better :) [18:24:54] twentyafterfour: I think so, from my end of things [18:25:06] sitic: still getting prod host 502 errors? [18:25:09] * aude checks i18n stuff [18:25:31] twentyafterfour: looks ok to me [18:26:12] bblack: no, seems fine now [18:27:22] 6operations, 10RESTBase-Cassandra: upgrade RESTBase cluster to Cassandra 2.1.8 - https://phabricator.wikimedia.org/T107949#1511786 (10GWicke) +1 from me as well. [18:27:29] (03CR) 10BBlack: "Note on reversion: requests to test.wikipedia.org (and similar) were failing with 502 errors from nginx. Testing directly against varnish" [puppet] - 10https://gerrit.wikimedia.org/r/225243 (owner: 10Ori.livneh) [18:28:18] sitic: I'm guessing that's somehow related to the testwiki do_esi then, but I can't fathom the connection other than "anytime ESI is on in varnish, shit breaks" :P [18:28:49] sitic: any chance your requests were using "X-Wikimedia-Debug: 1" header? [18:28:53] 6operations: generate command lists for dump scheduler - https://phabricator.wikimedia.org/T107860#1511788 (10ArielGlenn) 5Open>3Resolved done in https://gerrit.wikimedia.org/r/#/c/229134/ and a follupw correction to some numbers. [18:28:54] 6operations: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1511790 (10ArielGlenn) [18:29:55] hmmm even then I don't think that sets the request hostname to test [18:30:01] bblack: unfortunately I didn't have the headers, I was only getting error emails about the 502s und could see them in a browser or something else [18:30:07] ok [18:30:22] 6operations, 10vm-requests, 7Pybal: codfw: 3 VM %request for PyBal - https://phabricator.wikimedia.org/T107901#1511796 (10ori) [18:30:29] bblack: just got in, reading the backlog. [18:31:10] ori: so I tried it, but it caused varnish to fail completely (closed connection with no response) for testwiki requests with do_esi on [18:31:11] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1511799 (10Tgr) It sounds like you already had that commit locally. How did you install MediaWiki, tarballs or a git checkout? [18:31:53] ori: and then also, for some reason that caused similar breakage for sitic's API queries of the form: "https://en.wikipedia.org/w/api.php?action=query&meta=userinfo&uiprop=rights", which went away on ESI revert as well [18:31:56] 6operations, 10vm-requests: (do not) request VM for grafana - https://phabricator.wikimedia.org/T107832#1511800 (10Dzahn) [18:32:11] but in general, it mostly didn't impact prod, no other alarms went off anyways. not sure what the link is there [18:33:15] (03PS1) 1020after4: group1 wikis to 1.26wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229439 [18:33:30] (03CR) 1020after4: [C: 032] group1 wikis to 1.26wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229439 (owner: 1020after4) [18:33:35] (03Merged) 10jenkins-bot: group1 wikis to 1.26wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229439 (owner: 1020after4) [18:33:49] 6operations: copy partial dumps from dataset host to labs - https://phabricator.wikimedia.org/T108077#1511816 (10ArielGlenn) 3NEW a:3ArielGlenn [18:33:57] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.26wmf17 [18:33:58] in the logs I see the 502 for all sorts of API queries, but only a part of them failed. The same query worked seconds later [18:34:09] 6operations: copy partial dumps from dataset host to labs - https://phabricator.wikimedia.org/T108077#1511828 (10ArielGlenn) [18:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:10] 6operations, 7Tracking: staged dumps tracking task - https://phabricator.wikimedia.org/T107757#1511827 (10ArielGlenn) [18:35:03] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1511829 (10Dzahn) a:3RobH [18:35:39] 6operations: copy partial dumps from dataset host to labs - https://phabricator.wikimedia.org/T108077#1511816 (10ArielGlenn) Coren, I've added you on this so we can chat about space available in labs for the dumps copy. [18:36:07] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1511472 (10Dzahn) per IRC talk, thank you for trying to create this so akosiaris doesn't have to respond to all VM requests [18:36:13] 6operations, 6Labs: audit labs versus production ssh keys - https://phabricator.wikimedia.org/T108078#1511839 (10RobH) 3NEW a:3RobH [18:36:20] 6operations: copy partial dumps from dataset host to labs - https://phabricator.wikimedia.org/T108077#1511847 (10coren) Do you already have a ballpark of how much space you'd need? [18:36:38] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1511849 (10RobH) This is typically approved by Alex, as it is his process. However, this was already approved by @Mark. As such, we are still filing all the tickets for full audit. (Much like folks someti... [18:37:12] !log es1.7.1: restart elastic1012 [18:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:47] 6operations, 10Wikimedia-Mailing-lists: service IP can't be switched over - https://phabricator.wikimedia.org/T108080#1511861 (10Dzahn) 3NEW [18:38:10] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1511869 (10Dzahn) [18:39:34] (03CR) 10ArielGlenn: [C: 032] dumps: call to findandlocknextwiki was missing an argument [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/229434 (owner: 10ArielGlenn) [18:40:03] hey akosiaris, yt? [18:42:27] 6operations, 10Wikimedia-Mailing-lists: service IP can't be switched over - https://phabricator.wikimedia.org/T108080#1511881 (10JohnLewis) This is legacy from lily which @mark handled in the last migration. Looks like we could assign a new IP for lists. It's used as a secondary on the server for lists web and... [18:42:35] (03PS3) 10BBlack: VCL: use network::constants::all_networks_lo [puppet] - 10https://gerrit.wikimedia.org/r/229122 [18:42:37] (03PS4) 10BBlack: network::constants::all_networks(_lo)? via flatten() [puppet] - 10https://gerrit.wikimedia.org/r/228586 [18:42:38] 10Ops-Access-Requests, 6operations, 10Wikimedia-Mailing-lists: give John Lewis shell access on the staging VM - https://phabricator.wikimedia.org/T108082#1511883 (10Dzahn) 3NEW [18:42:39] (03PS3) 10BBlack: restrict_access: move to common code for all backends [puppet] - 10https://gerrit.wikimedia.org/r/229121 [18:42:41] (03PS3) 10BBlack: VCL: remove fqdn comment line [puppet] - 10https://gerrit.wikimedia.org/r/228584 [18:42:43] (03PS4) 10BBlack: varnish: get rid of some pre-systemd cruft [puppet] - 10https://gerrit.wikimedia.org/r/228591 [18:42:45] (03PS3) 10BBlack: vhtcpd: /etc/init/varnishhtcpd.conf is long-gone now [puppet] - 10https://gerrit.wikimedia.org/r/228590 [18:42:47] (03PS3) 10BBlack: VCL: define vcl_config "layer" for parsoidcache [puppet] - 10https://gerrit.wikimedia.org/r/228589 [18:42:49] (03PS3) 10BBlack: VCL: remove unused probes "swift", "options" [puppet] - 10https://gerrit.wikimedia.org/r/228588 [18:42:49] 10Ops-Access-Requests, 6operations, 10Wikimedia-Mailing-lists: give John Lewis shell access on the staging VM - https://phabricator.wikimedia.org/T108082#1511883 (10Dzahn) acked by Faidon/Mark in meeting today [18:42:52] 6operations: copy partial dumps from dataset host to labs - https://phabricator.wikimedia.org/T108077#1511890 (10ArielGlenn) About 3x times one full run to be on the safe side. One run these days takes (guesstimate) 2.5T so we're looking at 8T to be safe. I forget what the last round of negotiations landed us... [18:43:12] 6operations, 10Wikimedia-Mailing-lists: install jessie on new VM for mailman - https://phabricator.wikimedia.org/T108070#1511892 (10Dzahn) [18:43:14] (03CR) 10BBlack: [C: 032 V: 032] vhtcpd: /etc/init/varnishhtcpd.conf is long-gone now [puppet] - 10https://gerrit.wikimedia.org/r/228590 (owner: 10BBlack) [18:43:14] 10Ops-Access-Requests, 6operations, 10Wikimedia-Mailing-lists: give John Lewis shell access on the staging VM - https://phabricator.wikimedia.org/T108082#1511891 (10Dzahn) [18:43:25] (03CR) 10BBlack: [C: 032 V: 032] VCL: remove fqdn comment line [puppet] - 10https://gerrit.wikimedia.org/r/228584 (owner: 10BBlack) [18:43:36] (03CR) 10BBlack: [C: 032 V: 032] VCL: remove unused probes "swift", "options" [puppet] - 10https://gerrit.wikimedia.org/r/228588 (owner: 10BBlack) [18:43:42] 10Ops-Access-Requests, 6operations, 10Wikimedia-Mailing-lists: give John Lewis shell access on the mailman staging VM - https://phabricator.wikimedia.org/T108082#1511893 (10Krenair) [18:45:25] (03CR) 10BBlack: [C: 032 V: 032] VCL: define vcl_config "layer" for parsoidcache [puppet] - 10https://gerrit.wikimedia.org/r/228589 (owner: 10BBlack) [18:45:46] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1511895 (10RobH) Actually, there are no existing ganeti VMs using public IP space. I'm not sure of the support of that. [18:46:07] halfak: krinkle mind if I setup a meeting on.... friday? Just a chat about cvn and how ORES fits into it [18:46:18] 10Ops-Access-Requests, 6operations, 10Wikimedia-Mailing-lists: give John Lewis shell access on the mailman staging VM - https://phabricator.wikimedia.org/T108082#1511897 (10JohnLewis) T102075 is previous ticket which was discussed in an ops meeting. Currently unable to provide a SSH key due to ongoing ISP i... [18:46:27] +1 [18:46:45] halfak: can you actually do the setup? :D New laptop, still don't have calendering setup [18:46:54] krinkle there's also https://gerrit.wikimedia.org/r/#/c/229423/ [18:47:00] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1511900 (10Dzahn) [18:47:16] wtf is ns 2600? [18:47:23] (03CR) 10Ottomata: "I had asked Faidon generally about the idea of including dependencies that aren't available across all targeted distros, and he said he th" (035 comments) [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [18:47:35] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1511472 (10Dzahn) i adjusted the request to "private IP". we probably don't need a public one just yet, at least not for testing the config and archive import [18:47:36] YuviPanda: Sounds good. [18:48:01] Mjbmr, "Topic" [18:48:04] It's used by Flow [18:48:11] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 42 ESP OK [18:48:11] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 42 ESP OK [18:48:11] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 42 ESP OK [18:48:11] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 32 ESP OK [18:48:21] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 42 ESP OK [18:48:25] halfak: no talk ns? [18:48:31] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 32 ESP OK [18:48:31] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 32 ESP OK [18:48:31] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 42 ESP OK [18:48:31] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 32 ESP OK [18:48:32] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 32 ESP OK [18:48:40] RECOVERY - IPsec on cp3013 is OK: Strongswan OK - 32 ESP OK [18:48:40] RECOVERY - IPsec on cp3014 is OK: Strongswan OK - 32 ESP OK [18:48:40] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 42 ESP OK [18:48:40] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 42 ESP OK [18:48:41] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 32 ESP OK [18:48:48] Mjbmr, well, I think that would be ns 1 ;) [18:49:00] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 42 ESP OK [18:49:00] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 32 ESP OK [18:49:00] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 32 ESP OK [18:49:00] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 42 ESP OK [18:49:12] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 42 ESP OK [18:49:12] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 42 ESP OK [18:49:12] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 32 ESP OK [18:49:12] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 32 ESP OK [18:49:20] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 42 ESP OK [18:49:22] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 32 ESP OK [18:49:22] RECOVERY - IPsec on cp3012 is OK: Strongswan OK - 32 ESP OK [18:49:22] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 42 ESP OK [18:49:22] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 42 ESP OK [18:49:23] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 16 ESP OK [18:49:23] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 32 ESP OK [18:49:23] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 42 ESP OK [18:49:27] ^ Lol [18:49:29] This channel [18:49:41] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 42 ESP OK [18:49:41] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 42 ESP OK [18:49:44] halfak: do you think the name of flow must be localized in other languages as well, or equal to "flow"? [18:49:50] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - 32 ESP OK [18:49:51] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 42 ESP OK [18:50:00] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 32 ESP OK [18:50:00] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 32 ESP OK [18:50:00] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 42 ESP OK [18:50:08] Mjbmr, not sure. I bet that have some answers in #wikimedia-collaboration though. [18:50:10] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 42 ESP OK [18:50:11] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 32 ESP OK [18:50:11] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 32 ESP OK [18:50:11] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 32 ESP OK [18:50:11] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 42 ESP OK [18:50:11] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 42 ESP OK [18:50:11] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 16 ESP OK [18:50:33] halfak: they translated flow in Persian equal to stream, and ns 2600 equal to subject. [18:50:37] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1511925 (10RobH) The next available name in elements is fermium. I'd like to create these, but I don't want to overstep into @akosiaris' process. Perhaps we can get him to sign off on this and kick it bac... [18:50:40] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 16 ESP OK [18:50:45] Mjbmr, makes sense to me. [18:50:50] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 16 ESP OK [18:50:51] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 16 ESP OK [18:50:55] halfak: meh [18:51:01] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 16 ESP OK [18:51:08] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1511934 (10JohnLewis) Private is fine for now as we're just testing the process. Perhaps add to misc do we can view and physically interact but that's not a strict requirement per se. [18:51:11] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 16 ESP OK [18:51:20] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 16 ESP OK [18:52:35] halfak: sorry, I don't have a good answer for disabling ipsec checks for host reboots. any one host causes that alert on like 20 others :/ [18:52:45] halfak: to you does "flow bots" means flow bots? [18:54:25] bblack, no worries. So long as you guys don't mind it. :) [18:54:30] !log depooled cp1060, cp1064 ( thermal batch 3: https://phabricator.wikimedia.org/T103226 ) [18:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:43] Mjbmr, to me, "Flow Bots" reminds me of a hiphop band https://en.wikipedia.org/wiki/Flobots [18:54:44] ^ that will cause the final batch of expect ipsec fail -> recover for today heh [18:55:03] halfak: thank you [18:55:08] :D [18:56:02] (03CR) 10HaeB: [C: 031] "Hi Daniel, actually the EventLogging instrumentation for the blog's pageviews was ported to the third-party hosting and is still generatin" [puppet] - 10https://gerrit.wikimedia.org/r/229390 (owner: 10Dzahn) [18:56:43] ok, YuviPanda, merging my thing, will run puppet on stat boxes (to make sure my hiera change is cool), and then on labstore1003 [18:56:51] (03PS4) 10Ottomata: labs: Setup /srv/statistics for rsync from stats hosts [puppet] - 10https://gerrit.wikimedia.org/r/229262 (https://phabricator.wikimedia.org/T107576) [18:56:51] ottomata: ok! then we can test on some project [18:57:00] well, we will need a hole pocked! [18:57:02] poked [18:57:04] to rsync anything [18:57:07] from analytics VLAN [18:57:18] but ja at least we can test the mount [18:58:02] ottomata: oh ya true [18:59:05] uhhhh, yesterday I could log into labstore1003, now I get Permission denied (publickey).? [18:59:26] ottomata: root@ [18:59:30] (03CR) 10Ottomata: [C: 032] labs: Setup /srv/statistics for rsync from stats hosts [puppet] - 10https://gerrit.wikimedia.org/r/229262 (https://phabricator.wikimedia.org/T107576) (owner: 10Ottomata) [18:59:55] (03PS17) 10Dduvall: beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) [18:59:58] (03CR) 10JanZerebecki: [C: 031] Add php5-curl package to Phragile. [puppet] - 10https://gerrit.wikimedia.org/r/229355 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [19:00:00] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1060_v4, cp1060_v6 [19:00:11] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:00:11] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:00:11] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1060_v6 not-conn: cp1060_v4 [19:00:11] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:00:14] naw [19:00:31] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:00:31] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:00:41] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:00:51] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:00:57] (03CR) 10Dduvall: [C: 031] "Cherry-picked the latest PS on deployment-puppetmaster. It applies/provisions cleanly and works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) (owner: 10Dduvall) [19:01:01] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:01:02] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:01:02] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:01:02] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1060_v4, cp1060_v6 [19:01:10] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:01:11] YuviPanda: I do have an account there, rigth? can you see auth.log on labstore1003? y it no like me? [19:01:11] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:01:11] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:01:21] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:01:23] ottomata: hmm, checkin [19:01:25] or, i don't really care, will you run puppet ther efor me? [19:01:30] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:01:31] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1060_v4, cp1060_v6 [19:01:40] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:01:40] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:01:41] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1060_v4, cp1060_v6 [19:01:44] ottomata: hmm, I can't ssh in either?! [19:01:51] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1060_v4, cp1060_v6 [19:01:51] coren can you ssh into labstore1003? [19:01:52] * YuviPanda can't [19:02:00] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:02:00] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1060_v4, cp1060_v6 [19:02:01] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:02:10] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1060_v4, cp1060_v6 [19:02:11] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:02:20] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:02:20] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp1064_v4, cp1064_v6 [19:03:36] YuviPanda: Hm. Only w/ root, yet I see the opsen accounts on it. [19:03:44] YuviPanda: Lemme check the logs [19:03:54] coren hmm, I can't ssh in with root either [19:04:19] oh that's my .ssh/config fucking u [19:04:54] Ah, and I can log in as myself as well - just not to labstor1003. :-) [19:05:16] (Proxycommand does have the disadvantage that most actual error messages are ridiculously obscured) [19:07:38] proxycommand should obscure errors to bastion, but not to the host itself (the -v -v is given to the command that connects to the host, not to the command that connects to bastion) [19:08:41] coren nvm, was a ssh config issue :) [19:08:42] so ssh -v -v labstore1003.eqiad.etc *should* give you something useful [19:08:45] ottomata: puppet has run [19:09:11] PROBLEM - puppet last run on labstore1003 is CRITICAL Puppet has 1 failures [19:09:52] ottomata: oof your change applied admin to that node.... [19:09:56] - cluster: labsnfs [19:09:56] + cluster: misc [19:09:59] coren ^ [19:10:02] ugh this might be bad [19:10:18] and removed [19:10:22] -hosts allow = 208.80.154.11 208.80.152.185 [19:10:22] + [19:10:28] ? [19:10:32] my cahnge did? [19:10:35] yes [19:10:44] looking.... [19:10:53] YuviPanda: Thankfully, changes to 1003 are unlikely to break anything but 1003. [19:11:07] OH, because it was a role hiera thing maybe??? [19:11:27] ottomata: and you didn't fix the $dump_server_ips thing I pointed out [19:11:41] :P [19:11:41] hm i see one mistake [19:11:46] i added the param but didn't fix the hiera! [19:11:49] fixing that now [19:11:54] i don't see the cluster change though [19:12:07] ottomata: how so? I don't see the param [19:12:18] (03CR) 10JanZerebecki: [C: 031] wikidata query: add misc-web configuration [puppet] - 10https://gerrit.wikimedia.org/r/229392 (https://phabricator.wikimedia.org/T107602) (owner: 10Giuseppe Lavagetto) [19:13:38] !!! i didn't push it, oh man> [19:14:11] PROBLEM - puppet last run on cp3020 is CRITICAL puppet fail [19:14:58] (03PS1) 10Andrew Bogott: Update archive-project-volumes to support our new NFS setup. [puppet] - 10https://gerrit.wikimedia.org/r/229458 [19:15:05] seems like cp3020 is just random master fail [19:15:08] ottomata: :) [19:15:21] (03PS1) 10Ottomata: Fix for $dumps_servers_ips for role::labs::nfs::extras [puppet] - 10https://gerrit.wikimedia.org/r/229459 [19:15:22] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 42 ESP OK [19:15:22] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 42 ESP OK [19:15:25] (03PS1) 10Sbisson: Disable Special:NewMessages on wiki with LiquidThreads frozen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229460 (https://phabricator.wikimedia.org/T107898) [19:15:32] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 42 ESP OK [19:15:40] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 42 ESP OK [19:15:51] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 42 ESP OK [19:15:51] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 42 ESP OK [19:15:51] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 42 ESP OK [19:15:51] (03PS2) 10Andrew Bogott: Update archive-project-volumes to support our new NFS setup. [puppet] - 10https://gerrit.wikimedia.org/r/229458 (https://phabricator.wikimedia.org/T104857) [19:15:51] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 16 ESP OK [19:15:52] YuviPanda: https://gerrit.wikimedia.org/r/#/c/229459/1 better? [19:15:59] still not sure how the cluster got changed thogh [19:16:01] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 42 ESP OK [19:16:01] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 42 ESP OK [19:16:01] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 42 ESP OK [19:16:11] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 42 ESP OK [19:16:11] RECOVERY - puppet last run on cp3020 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:16:20] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 42 ESP OK [19:16:21] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 42 ESP OK [19:16:21] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 16 ESP OK [19:16:21] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 42 ESP OK [19:16:31] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 16 ESP OK [19:16:36] AH [19:16:37] ottomata: yeah, trying to find out also how the admin module got included. I guess that's just the cluste [19:16:38] ahahah [19:16:40] it was in the hiera file [19:16:41] got it [19:16:41] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 16 ESP OK [19:16:41] yeah [19:16:42] role stuff [19:16:44] that hsould fix it then [19:16:51] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 42 ESP OK [19:16:51] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 42 ESP OK [19:16:51] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 16 ESP OK [19:17:01] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 16 ESP OK [19:17:02] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 16 ESP OK [19:17:02] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 42 ESP OK [19:17:02] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 42 ESP OK [19:17:10] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 42 ESP OK [19:17:12] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 42 ESP OK [19:17:12] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 42 ESP OK [19:17:12] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 16 ESP OK [19:17:12] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 42 ESP OK [19:17:19] yeah, YuviPanda cluster: labsnfs was in the dumps.yaml hiera role file [19:17:28] so when I changed the classname, it wasn't applied [19:17:38] (03PS3) 10Andrew Bogott: Update archive-project-volumes to support our new NFS setup. [puppet] - 10https://gerrit.wikimedia.org/r/229458 (https://phabricator.wikimedia.org/T104857) [19:17:51] (03CR) 10Yuvipanda: [C: 04-1] "Would also be great if we can make it pep8 compatible, not a big deal though :) I'm hoping we can clean these up as we go along..." [puppet] - 10https://gerrit.wikimedia.org/r/229458 (https://phabricator.wikimedia.org/T104857) (owner: 10Andrew Bogott) [19:17:58] ottomata: that makes sense [19:18:01] (03CR) 10Ottomata: [C: 032] Fix for $dumps_servers_ips for role::labs::nfs::extras [puppet] - 10https://gerrit.wikimedia.org/r/229459 (owner: 10Ottomata) [19:18:31] YuviPanda: running puppet on labstore1003 (i can log in now!) [19:18:59] ottomata: ya ok [19:19:10] oh ha, i could log in because I applied admin module? :p [19:19:15] hahahah [19:19:15] yes [19:19:18] (03PS4) 10Andrew Bogott: Update archive-project-volumes to support our new NFS setup. [puppet] - 10https://gerrit.wikimedia.org/r/229458 (https://phabricator.wikimedia.org/T104857) [19:19:18] ottomata: indeed, you shouldn't have been able ot [19:19:20] *to [19:19:23] now i can't log in anymore [19:19:23] hah [19:19:44] andrewbogott: I left some comments, I'll do another pass when this is done (labstore1003 issues) [19:19:46] !log all caches depooled for thermal stuff repooled [19:19:46] ok, but that looks better. :) [19:19:50] RECOVERY - puppet last run on labstore1003 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [19:19:52] so YuviPanda I think you can test the nfs mount [19:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:20:11] (03PS1) 10EBernhardson: Log CirrusSearchUserTesting monolog channel to fluorine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229462 [19:20:39] YuviPanda: it is pep8 isn’t it? Via my test it is [19:20:45] andrewbogott: oh, ok. [19:20:57] hence all the whitespace changes in my patch :( [19:21:06] chasemp do you know what happens to a host if admin is applied then unapplied? [19:21:59] 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1512027 (10BBlack) @cmjohnson did the thermal paste work on the other 8 hosts. So far everything looks peachy on the core temp values: ``` root@p... [19:22:01] (03PS5) 10Andrew Bogott: Update archive-project-volumes to support our new NFS setup. [puppet] - 10https://gerrit.wikimedia.org/r/229458 (https://phabricator.wikimedia.org/T104857) [19:22:11] ottomata: hmm, wondering which project to test this on [19:22:18] I'll test it on the analytics project I guess [19:22:21] hmm [19:22:22] no [19:22:25] I want that project to die :P [19:22:29] WHAT?! [19:22:29] I'll test it on something completely unrelated [19:22:30] why? [19:22:36] I imagine nothing really as no admin will mean no account mgmt at all. Status quo will persist [19:23:09] it does remove accounts [19:23:20] or, at least [19:23:20] ssh keys [19:23:27] Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/otto]/ensure: removed [19:23:34] but i guess that is just ssh doing its thang [19:23:35] ah ok [19:23:40] homedirs all exist [19:23:41] so as long as root is fine that's great [19:23:46] Ah sshkeys is probably a purge from puppet [19:23:50] and /etc/passwd entries [19:23:50] (03PS1) 10Yuvipanda: labs: Enable statistics mount on a random host to test [puppet] - 10https://gerrit.wikimedia.org/r/229466 [19:24:01] (03PS2) 10Yuvipanda: labs: Enable statistics mount on a random host to test [puppet] - 10https://gerrit.wikimedia.org/r/229466 [19:24:10] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Enable statistics mount on a random host to test [puppet] - 10https://gerrit.wikimedia.org/r/229466 (owner: 10Yuvipanda) [19:24:16] Not technically from admin :) but good point [19:24:17] we should probably cleanup the homedirs [19:24:28] YuviPanda: I can delete them [19:24:31] ottomata: are you on palladium? do a puppet merge? [19:24:35] i have the list of them in front of me from puppet [19:24:36] sure [19:24:43] thanks [19:24:46] YuviPanda: merged. [19:24:48] I'm testing the mount [19:24:57] haha, and i'm still logged into labstore so maybe I can delete [19:25:19] ottomata: if your key isn't in the root key list it should be :D [19:25:37] ottomata: YES THE MOUNT WORKS WOO [19:25:55] (03PS1) 10Yuvipanda: Revert "labs: Enable statistics mount on a random host to test" [puppet] - 10https://gerrit.wikimedia.org/r/229467 [19:26:06] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "labs: Enable statistics mount on a random host to test" [puppet] - 10https://gerrit.wikimedia.org/r/229467 (owner: 10Yuvipanda) [19:27:30] 6operations: copy partial dumps from dataset host to labs - https://phabricator.wikimedia.org/T108077#1512046 (10coren) That's... not an issue. :-) Since we moved to labstore1003, there is some 40T available for dumps (with the caveat that this lives on media that is not otherwise backed up or very redundant u... [19:27:34] YuviPanda: I think it might be, but a long time ago something changed and i can't log in as root anymore, i think it is my ssh settings [19:27:38] but i try to avoid logging in as root anyway [19:27:42] and i haven't needed to [19:27:49] so I haven't figured out why [19:28:02] ottomata: ya but labstores you can only login as root [19:28:05] YuviPanda: i moved homedirs into /tmp/badhomes/ [19:28:06] we hope to change that at some point [19:28:09] ottomata: sweet [19:28:17] haha, YuviPanda I just accidentally changed it [19:28:19] why not leave it :p [19:28:20] ? [19:28:28] ottomata: because it messes with groupids on the host [19:28:33] which are present from ldap [19:28:36] hm [19:28:37] k [19:28:44] we might have to manually cleanup etc/passwd too [19:28:57] 6operations, 6Analytics-Backlog, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1512047 (10Krinkle) Raising priority and adding to #Performance-Team workboard. The fact that our front ends seem to be servin... [19:29:09] 6operations, 6Performance-Team, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1512048 (10Krinkle) p:5Triage>3High [19:29:12] YuviPanda: I can do that [19:29:14] will userdel them all [19:29:31] ottomata: cool [19:31:02] ha uh, YuviPanda I can't remove me and you cause we are logged in [19:31:05] you log in as root and do it? [19:31:09] ottomata: ok! [19:31:09] lgging out... [19:31:48] !log es1.7.1: restart elastic1013 [19:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:08] ottomata: cool, tx :D I guess you can write up docs for the researchers and what not? [19:34:48] ja can do [19:36:56] !log es1.7.1: resume writes to indices [19:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:38:33] IndexMissingException[[mediawiki_cirrussearch_frozen_indexes] missing] 500 in 10 seconds [19:39:01] it seems so [19:39:14] just a spike [19:42:19] YuviPanda: instructions to mount the NFS mount in labs shoudl be what, edit Hiera: [19:42:20] ? [19:42:50] and set [19:42:58] mounts: [19:42:58] statistics: true [19:42:58] ? [19:43:07] oh [19:43:08] or [19:43:09] exit [19:43:10] ottomata: it's a config file in ops/puppet that a) configures the server and b) sets the mounts, I think [19:43:17] ottomata: nope [19:43:23] aye but it would be cooler if people could do it on their ownwithout comitting [19:43:24] oh [19:43:32] ottomata: it's modules/labstore/files/nfs-mounts.yaml [19:43:34] nfs_mounts: [19:43:34] statistics: true [19:43:34] ? [19:43:46] YuviPanda: https://wikitech.wikimedia.org/wiki/Hiera:Analytics [19:43:48] ottomata: nope, that was the older method that's been killed since. now enabling a mount requires a commit [19:43:49] i didn't do that, but it is there! [19:43:52] ah [19:43:54] hm. [19:44:24] ottomata: see https://gerrit.wikimedia.org/r/#/c/229466/ [19:45:32] YuviPanda: heh, the file doesn't actually say 'nfs' inside it, so it doesn't show up in github search |:( [19:45:41] k [19:46:02] YuviPanda: https://wikitech.wikimedia.org/wiki/Analytics/FAQ [19:46:06] correct? [19:47:41] (03PS2) 10Dzahn: statistics: remove blog pageviews script [puppet] - 10https://gerrit.wikimedia.org/r/229390 [19:48:42] (03CR) 10Dzahn: "ok, great! Thanks Tilman, merging :)" [puppet] - 10https://gerrit.wikimedia.org/r/229390 (owner: 10Dzahn) [19:49:01] (03CR) 10Dzahn: [C: 032] statistics: remove blog pageviews script [puppet] - 10https://gerrit.wikimedia.org/r/229390 (owner: 10Dzahn) [19:49:08] ottomata: looking [19:49:09] 6operations, 6Labs: audit labs versus production ssh keys - https://phabricator.wikimedia.org/T108078#1512121 (10Krenair) According to my script, both @mvolz and @jdouglas have labs keys in production. [19:49:33] ottomata: isn't the target /srv/staticstics? [19:49:49] ottomata: and I don't know if we should encourage people to have datasests with usernames in them [19:52:50] !log disabled puppet on restbase hosts in preparation for the deploy [19:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:53:02] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Assign varnish memory-only role to maps servers - https://phabricator.wikimedia.org/T105076#1512157 (10Ironholds) Questions from the data end: 1. Is this going to be integrated with the existing kafka pipelines for logging? 2. If so, what source clu... [19:53:08] (03PS4) 10Dzahn: logstash: add cluster hostnames to hiera [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) [19:54:21] (03PS5) 10Dzahn: logstash: add cluster hostnames to hiera [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) [19:54:57] (03PS6) 10Dzahn: logstash: add cluster hostnames to hiera [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) [19:55:01] PROBLEM - puppet last run on praseodymium is CRITICAL Puppet last ran 1 day ago [19:55:22] !log re-enabled puppet on restbase staging cluster in preparation for deploy [19:55:23] YuviPanda: that's how we do it in databases [19:55:25] keeps things sane [19:55:29] these peopel do lots of one offs [19:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:55:30] PROBLEM - puppet last run on cerium is CRITICAL Puppet last ran 1 day ago [19:55:47] ottomata: hmm, ok. I edited, do revert if you want :D [19:55:50] (03CR) 10Dzahn: [C: 032] "just adding hostnames so they can be used later, not used as of now" [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) (owner: 10Dzahn) [19:56:04] ottomata: not sure about the /srv/statisics tho [19:56:06] ah, no YuviPanda, the target is ::statistics [19:56:09] it is the rsync module name [19:56:11] PROBLEM - puppet last run on xenon is CRITICAL Puppet last ran 1 day ago [19:56:12] (03PS2) 10Yuvipanda: Update templates for RESTBase deploy [puppet] - 10https://gerrit.wikimedia.org/r/229429 (owner: 10GWicke) [19:56:18] it maps to whatever it says in the module config uhh [19:56:18] (03CR) 10Yuvipanda: [C: 032 V: 032] Update templates for RESTBase deploy [puppet] - 10https://gerrit.wikimedia.org/r/229429 (owner: 10GWicke) [19:56:25] yeah [19:56:27] /srv/statistics', [19:56:34] so in your rsync command you target the module name [19:56:35] ::statistics [19:56:36] ah ok [19:56:39] i didn't know [19:56:42] revert then :D [19:56:45] it use rsync protocol instead of ssh [19:56:49] if you do :: [19:56:51] two colons [19:56:56] grrrrreeeeabse [19:57:11] RECOVERY - puppet last run on praseodymium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:57:49] YuviPanda: ya they don't have to keep things in username based dirs/databases, but we encourage it, cause otherwise things get messy [19:57:54] people need their own workspaces [19:58:07] if there are more generic things they can put them in a diferent place [19:58:10] but by default that should be the habit [19:58:20] RECOVERY - puppet last run on xenon is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:58:22] we do that on the /srv partitions on the stat boxes a lot too, not always, but a lot, [19:58:23] (03PS7) 10Dzahn: logstash: add cluster hostnames to hiera [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) [19:58:31] ottomata: ya fair enough. [19:58:38] its just one big disk to them, and there needs to be some way of organizing [19:58:52] (03PS8) 10Dzahn: logstash: add cluster hostnames to hiera [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) [19:59:16] subbu: any objections against me going early? [19:59:32] no, go ahead. :) [19:59:50] RECOVERY - puppet last run on cerium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:59:54] anyway, thanks YuviPanda, when we get the ACL change i'll test the rsync and let people know. [19:59:58] ottomata: cool [20:00:00] twentyafterfour: Train ready? [20:00:04] gwicke cscott arlolra subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150805T2000). Please do the needful. [20:00:55] Krinkle: it's done [20:00:59] (03CR) 10Dzahn: "i added the hostnames to hiera, you should now be able to lookup them up for this" [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [20:01:24] OK. Got a patch to backport to wmf17 asap. [20:01:38] wanted to ride the train, but just missed it [20:02:04] * YuviPanda puts Krenair on caltrain [20:02:15] what [20:02:19] lol [20:02:28] I didn't know Caltrain goes through the UK. [20:02:37] stupid autocomplete [20:02:41] :P [20:02:41] :D [20:02:47] it still thinks krinkle isn't in this channel [20:02:52] I can autocomplete him fine on other channels [20:02:53] brr [20:02:54] is it the "baby bullet"? otherwise it's slow as hell [20:03:03] YuviPanda, stupid but entertaining! [20:03:52] gwicke: let me know when it's all good with the deploy [20:04:28] YuviPanda: I'm going to take my time [20:04:45] gwicke: yep, that's fine :) just notify me when done [20:04:53] subbu: actually, you can go first [20:04:59] ok. [20:05:04] not rushing, just registering a callback :) [20:05:07] I need to quickly do something else first before switching the code [20:05:18] (03PS2) 10Dzahn: Add ferm rules for Logstash/Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [20:07:19] !log ori Synchronized php-1.26wmf17/extensions/PageTriage: I2089b21fc: Updated mediawiki/core Project: mediawiki/extensions/PageTriage 22eddf4ad5bf6b3fe7c49af5812ce5fcfa5e1911 (duration: 00m 14s) [20:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:31] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL 10.71% of data above the critical threshold [100000000.0] [20:08:32] greg-g: Ok to push a minor Wikidata update fixing two maintenance scripts? I'll coordinate with the service people or do it after they're done [20:09:41] (03PS2) 10Dzahn: Add php5-curl package to Phragile. [puppet] - 10https://gerrit.wikimedia.org/r/229355 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [20:10:49] (03CR) 10Dzahn: [C: 032] Add php5-curl package to Phragile. [puppet] - 10https://gerrit.wikimedia.org/r/229355 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [20:12:41] !log krinkle Synchronized php-1.26wmf17/includes/resourceloader/ResourceLoader.php: T104950 (duration: 00m 13s) [20:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:13:42] !log krinkle Synchronized php-1.26wmf17/includes/resourceloader/ResourceLoaderFileModule.php: T104950 (duration: 00m 12s) [20:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:42] twentyafterfour: Hm.. tin: 1.26wmf16 extensions Wikidata is git-status: dirty [20:18:46] local uncommitted patch [20:18:57] adding getStats()->increment in two files [20:19:11] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL 13.79% of data above the critical threshold [100000000.0] [20:19:16] Know anything about that? [20:19:43] hoo: ^ [20:20:22] Krinkle: Yeah, ori put it in [20:20:33] We stash it and restore if we need to update [20:20:43] oh, I think we can get rid of that now, sorry [20:21:00] it;s just git warning at me with red alert when I cd'ed into wmf16 directory on tin [20:21:25] No problem, just noticed it [20:21:56] !log krinkle Synchronized php-1.26wmf16/includes/resourceloader/ResourceLoader.php: T104950 (duration: 00m 11s) [20:22:01] (03CR) 10Dzahn: [C: 04-1] "the ports are fixed now, so that's good, but i think the hiera lookup part will fail like this" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224095 (https://phabricator.wikimedia.org/T104962) (owner: 10Muehlenhoff) [20:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:15] !log krinkle Synchronized php-1.26wmf16/includes/resourceloader/ResourceLoaderFileModule.php: T104950 (duration: 00m 12s) [20:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:25:33] !log deployed parsoid version d5a5722c [20:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:30:59] 10Ops-Access-Requests, 6operations, 10Wikimedia-Mailing-lists: give John Lewis shell access on the mailman staging VM - https://phabricator.wikimedia.org/T108082#1512283 (10JohnLewis) Managed to get access (creatively) to my encrypted store. Generated ssh key is at P1837. Thanks. [20:36:39] Krinkle: Are you done deploying? [20:36:59] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1512287 (10RobH) a:5RobH>3akosiaris I'll note that upon further reflection, I'm not entirely comfortable circumventing Alex in this process. When I setup the process with him awhile ago, all of these we... [20:37:26] hoo: Not anymore. [20:38:00] So you're not deploying anymore or not done anymore? :D [20:39:35] gwicke: are you going to do the deploy at all? Do keep me updated so I know I'm not going to have to just sit here for an hour and not walk around. [20:40:01] (03PS5) 10Smalyshev: Fix rules.log error when starting Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/229150 [20:41:01] hoo: I'm done. [20:41:06] Ah nice [20:41:11] will push my update then [20:43:05] YuviPanda: yes, I will; just in a meeting & didn't manage to finish it before [20:43:19] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Build new latest stable (0.8.2.1?) Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1512300 (10Ottomata) Phew, ok, Joseph and I tested this migration again in labs: https://etherpad.wikimedia.org/p/ka... [20:44:04] gwicke: alright. I'll have to go soon, so I do hope you finish up before 2 and not disappear again :) [20:45:37] 6operations, 6Labs: audit labs versus production ssh keys - https://phabricator.wikimedia.org/T108078#1512302 (10Mvolz) Whoops. You can revoke that key, I haven't needed production access in a while. https://gerrit.wikimedia.org/r/190405 was the change. [20:46:16] 10Ops-Access-Requests, 6operations, 10Wikimedia-Mailing-lists: give John Lewis shell access on the mailman staging VM - https://phabricator.wikimedia.org/T108082#1512303 (10Dzahn) This means sudo ALL ALL on the staging VM for the testing phase. [20:49:10] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 500 (expecting: 200) [20:49:17] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1512312 (10RobH) Note: While the testing instance can use a private IP, the eventual live instance will need a public IP. Since Alex will have to review this task anyone, perhaps he can advise. When lookin... [20:50:16] Krenair: dude... you fucking rock [20:50:35] thx for checking the keys! [20:50:54] can you push your script up to the ticket so future checks are easy? =] [20:51:03] or is it already in git? [20:51:22] robh we can even make it run via jenkins :) [20:51:27] indeed [20:51:40] cuz i stupidly didnt check one last week and pushed it [20:51:53] then realized it only cuz i had to techsupport ssh config =] [20:52:02] robh, haha, okay [20:52:03] gwicke: is the restbase alert you? please log your actions and not leave me hanging. Thanks. [20:52:06] it's sort of partially in git [20:52:14] it's an existing script with other stuff hacked in/out [20:52:17] I'm updating the ops side directions to stress checking it, but having an actual check as YuviPanda suggests is even better. [20:53:21] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [20:54:30] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1512318 (10csteipp) @Smalyshev, before we deploy this, can we task someone with updating $wgCrossSiteAJAXdomains to remove it from CORS domains,... [20:55:33] (03PS1) 10Dzahn: admin: deactivate shell for user mvolz [puppet] - 10https://gerrit.wikimedia.org/r/229562 [20:55:38] YuviPanda: back now [20:55:53] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1512322 (10Smalyshev) @csteipp sure. but I have no idea who that would be. Could you create a task and assign it to appropriate person? [20:55:55] !log hoo Synchronized php-1.26wmf16/extensions/Wikidata/: Update Wikibase: Fix the dumpJson and the rebuildItemsPerSite maintenance scripts (duration: 00m 20s) [20:55:59] gwicke: so that restbase alert wasn't you? [20:56:00] 6operations, 10Wikimedia-Mailing-lists: Rename usergroups@ to usergroup-applications@ - https://phabricator.wikimedia.org/T108099#1512323 (10JohnLewis) @robh another list rename. [20:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:56:06] 6operations: revoke mvolz production access - https://phabricator.wikimedia.org/T108100#1512328 (10RobH) 3NEW a:3RobH [20:56:07] robh: ^^ [20:56:09] gwicke: you're almost at the end of your window. [20:56:24] !log hoo Synchronized php-1.26wmf17/extensions/Wikidata/: Update Wikibase: Fix the dumpJson and the rebuildItemsPerSite maintenance scripts (duration: 00m 20s) [20:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:56:31] heh i didnt do the other one yet either... i need to schedule that... lets go for next tuesday for all of them. [20:56:39] tomorrow seems to short a notice [20:56:44] (03PS2) 10Dzahn: admin: deactivate shell for user mvolz [puppet] - 10https://gerrit.wikimedia.org/r/229562 (https://phabricator.wikimedia.org/T108078) [20:56:45] and friday maint is ill fated. [20:56:53] gwicke: I have to leave for the office soon. Do you think this will be done in 5-10 mins and want me to hang around, or should I revert and reschedule? [20:57:15] heh, conflict, i'll step back [20:57:17] robh: give me a time and I can be around if you want me to as well :) [20:57:40] YuviPanda: it's okay; if everything goes south we can still keep puppet disabled until we can merge a revert [20:57:40] All the other ones have been at 10AM Pacific on Tuesday [20:57:41] YuviPanda: I'll be around from now forward [20:57:47] gwicke: ok [20:57:48] We could move it to 9AM [20:57:49] (03PS3) 10Dzahn: admin: deactivate shell for user mvolz [puppet] - 10https://gerrit.wikimedia.org/r/229562 (https://phabricator.wikimedia.org/T108100) [20:57:57] but otherwise why mess with tradition =] [20:58:03] its mailman tuesday! [20:58:12] followed by tacos! [20:58:12] (03PS1) 10BryanDavis: beta: Configure $wgStatsdServer and $wgStatsdMetricPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229565 (https://phabricator.wikimedia.org/T108091) [20:58:30] robh: I'll see what I can do for availability :) [20:58:55] 6operations, 5Patch-For-Review: revoke mvolz production access - https://phabricator.wikimedia.org/T108100#1512353 (10RobH) [20:59:00] wait, next week is your vacation no? [20:59:08] !log restbase 9e177f3 (deploy 7006f9f) canary deploy on restbase1001 [20:59:11] I'll be backign up a ton of stuff, and we dont have any odd things floating around [20:59:14] you should be ok to vacation! [20:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:59:23] gwicke: if you need something let me know [20:59:26] but i never turn down folks being on standby [21:00:11] gwicke: ok. I'll make my way to the office now. in the future, please be more communicative if someone offers to help volunteer to do a deploy for you. Thanks. [21:00:16] robh: if you say so - okay. If I can, I'll be around. But I won't jump hoops [21:00:34] (03CR) 10Ori.livneh: [C: 031] beta: Configure $wgStatsdServer and $wgStatsdMetricPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229565 (https://phabricator.wikimedia.org/T108091) (owner: 10BryanDavis) [21:00:37] YuviPanda: sorry about keeping you hanging [21:00:51] I'll try to be clearer about what I need next time [21:00:59] ok [21:01:13] JohnFLewis: well, the one horrible ting was when an unrelated changeset snuck in [21:01:18] and then we had to figure out what happened [21:01:33] this time i restart and apply all puppet stuff before i start, so at least we know where breaks originate [21:01:48] plus we've done enough of these recently that its not like mailman has run for months without interruption [21:02:02] robh: true true [21:02:10] (all of these seem like famous last words, if I thought the world wasn't dictated by logic and not fate) [21:02:41] you are doing quite enough heavy lifting in the migration as it is ;D [21:02:45] robh: note after user groups is renamed - there is a list creation request gor usergroups [21:03:50] Which is where that request comes from so usergroups should _NOT_ forward to the renamed list. [21:05:14] that took about 3 rereads to parse correctly. [21:05:25] rename on elist for one function [21:05:28] 6operations, 10Wikimedia-Mailing-lists: Rename usergroups@ to usergroup-applications@ - https://phabricator.wikimedia.org/T108099#1512368 (10JohnLewis) T99443 Should be done after (correct me if wrong). Thus the list should be renamed only without the Apache and exim Magic. [21:05:32] a new list will be created on old list name for new function. [21:05:46] so move all archives, leave no redireciotn. [21:05:47] !log es1.7.1 upgrade for es1014 [21:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:05:57] robh: yeah [21:06:01] cool [21:06:25] (03CR) 10Gergő Tisza: beta: Configure $wgStatsdServer and $wgStatsdMetricPrefix (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229565 (https://phabricator.wikimedia.org/T108091) (owner: 10BryanDavis) [21:06:39] Then either create the list or assign the creation task to me. Either works, the former is better :) [21:07:33] 6operations, 6Labs, 5Patch-For-Review: audit labs versus production ssh keys - https://phabricator.wikimedia.org/T108078#1512371 (10Krenair) Per request, here's the script. Stick it in modules/ldap/files/scripts (operations/puppet clone) on a machine which is connected to labs LDAP, and run it from that dire... [21:08:05] (03CR) 10Rush: "I was expecting manifest/role/labsphragile.pp" [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T101235) (owner: 10WMDE-leszek) [21:09:34] (03PS1) 10RobH: revoke mvolz production access [puppet] - 10https://gerrit.wikimedia.org/r/229566 [21:10:46] robh, mutante what about ensure => absent? [21:10:58] ? [21:11:22] Sorry, what do you mean? My patchset sets the user to absent [21:11:42] (03CR) 10RobH: [C: 032] revoke mvolz production access [puppet] - 10https://gerrit.wikimedia.org/r/229566 (owner: 10RobH) [21:11:45] mine doesn't, and just sets the key to [] [21:11:54] well, this person doesnt need access [21:12:01] if it was just a wrong key, then it would be fine [21:12:06] oh, you both tried to remove the user [21:12:07] mutante: did you make the same patch? [21:12:14] the ticket is assigned to me... [21:12:14] https://gerrit.wikimedia.org/r/#/c/229562/ [21:12:25] mutante used the right syntax to trigger the gerrit bot :p [21:12:29] !log Started dumpwikidatajson.sh on snapshoot1003 to create a Wikidata json dump after earlier attempts this week failed. [21:12:31] ... [21:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:12:37] mutante: dude, stop poaching my ticket [21:12:39] =p [21:12:45] my patch was earlier :p [21:12:48] i'm assigned to it ;] [21:12:53] check timestamp [21:12:57] also: [21:12:57] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Add Redis to maps cluster - https://phabricator.wikimedia.org/T107813#1512376 (10Yurik) [21:12:58] I mad ethe task. [21:13:04] 13:58 < mutante> heh, conflict, i'll step back [21:13:05] what part is unclear, whatever, just ake it. [21:13:07] mutante: focus on another access requests :p [21:13:17] i dont want to argue about this, i just want to do work. [21:13:37] I just thought claiming the task I created was a clear way to show I was making the patch. [21:13:39] then dont and go ahead, i already said i'll step back over 15 minutes ago [21:13:49] snapshoot [21:14:03] (03Abandoned) 10Dzahn: admin: deactivate shell for user mvolz [puppet] - 10https://gerrit.wikimedia.org/r/229562 (https://phabricator.wikimedia.org/T108100) (owner: 10Dzahn) [21:14:16] ff only makes this painful already... [21:14:19] ;P [21:15:31] robh: if I had my shell, I would have made a patch too ;) [21:15:32] 6operations, 5Patch-For-Review: revoke mvolz production access - https://phabricator.wikimedia.org/T108100#1512382 (10RobH) multiple users doing the same work, same end result. access from production is now removed [21:15:36] Reedy: :D [21:15:39] 6operations, 6Labs, 5Patch-For-Review: audit labs versus production ssh keys - https://phabricator.wikimedia.org/T108078#1512384 (10RobH) [21:15:41] 6operations, 5Patch-For-Review: revoke mvolz production access - https://phabricator.wikimedia.org/T108100#1512383 (10RobH) 5Open>3Resolved [21:16:31] looks like someone else tried to CC me on a task I couldn't view [21:16:31] mutante: are you revoking jdouglas, if so i'll assign the task I'm creating to you [21:16:40] cuz he'll need to give us a new one [21:16:50] but i dont think we should wait on him, i think we should remove him immediately. [21:17:02] (just his key) [21:17:12] he has deployment rights [21:17:19] !log finished deploy of restbase 9e177f3 (deploy 7006f9f) on restbase cluster [21:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:17:51] robh: no, i'm not, i'm focusing on access for John now [21:18:00] ok, i'll do it then [21:18:08] im revoking immediately. [21:18:20] https://phabricator.wikimedia.org/T88464 was the ticket granting access [21:18:51] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [21:19:01] Krenair: ... no one reads that document! [21:19:09] everyone signs and it clearly says dont use same keys ;_; [21:19:19] 1007 is me, should be back in a moment [21:19:41] I read that document :/ [21:19:51] well, perhaps 'everyone' is a gross overreaction. [21:19:58] s/perhaps/is/ [21:20:10] meh, drop is, poor grammar [21:20:12] Although I think I read the version on wikitech, and didn't sign anything in phab itself [21:20:13] :P [21:20:27] Krenair: if you ask for any kind of productoin shell change you'll be asked [21:20:30] its part of the process [21:20:49] we aren't requiring existing folks to sign, unless we modify their access. [21:21:00] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [21:21:03] (modify, removing we'll just remove ;) [21:21:40] Yeah, I have no real plans to request a modification right now anyway [21:24:14] 6operations: Replace jdouglas's production ssh key - it matched labs key - https://phabricator.wikimedia.org/T108111#1512463 (10RobH) 3NEW a:3RobH [21:24:17] (03PS1) 10Yuvipanda: k8s: Initial terrible k8s module [puppet] - 10https://gerrit.wikimedia.org/r/229568 [21:25:03] (03CR) 10jenkins-bot: [V: 04-1] k8s: Initial terrible k8s module [puppet] - 10https://gerrit.wikimedia.org/r/229568 (owner: 10Yuvipanda) [21:25:55] (03PS2) 10Yuvipanda: k8s: Initial terrible k8s module [puppet] - 10https://gerrit.wikimedia.org/r/229568 [21:26:27] (03PS1) 10RobH: revoke jdouglas ssh key [puppet] - 10https://gerrit.wikimedia.org/r/229570 [21:26:52] (03CR) 10RobH: [C: 032] revoke jdouglas ssh key [puppet] - 10https://gerrit.wikimedia.org/r/229570 (owner: 10RobH) [21:27:17] * robh refreshes patchset for verified over and over ;D [21:27:20] (03PS3) 10Yuvipanda: k8s: Initial terrible k8s module [puppet] - 10https://gerrit.wikimedia.org/r/229568 [21:27:43] robh :D I'll wait to not conflict you [21:27:58] (03PS4) 10Yuvipanda: k8s: Initial terrible k8s module [puppet] - 10https://gerrit.wikimedia.org/r/229568 [21:28:03] heh, how did you know i was totally trolling you? [21:28:05] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Initial terrible k8s module [puppet] - 10https://gerrit.wikimedia.org/r/229568 (owner: 10Yuvipanda) [21:28:22] robh puppet merged yours too [21:28:30] i was about to ask if you were arleady on there [21:28:31] thx [21:30:58] 6operations, 10Wikimedia-Mailing-lists: Rename usergroups@ to usergroup-applications@ - https://phabricator.wikimedia.org/T108099#1512491 (10Krenair) [21:31:28] pushing a parsoid hotfix to deal with a crasher. [21:31:29] 6operations: Replace jdouglas's production ssh key - it matched labs key - https://phabricator.wikimedia.org/T108111#1512493 (10RobH) a:5RobH>3Jdouglas Assigning this to @jdouglas so he can update with his ssh key. Additionally, I'm adding this to ops-access-requests, just so it gets triaged and his new key... [21:31:40] 10Ops-Access-Requests, 6operations: Replace jdouglas's production ssh key - it matched labs key - https://phabricator.wikimedia.org/T108111#1512495 (10RobH) [21:34:06] 6operations, 10Wikimedia-Mailing-lists: Rename usergroups@ to usergroup-applications@ - https://phabricator.wikimedia.org/T108099#1512511 (10Krenair) > We also need to simultaneously update the contact form on Meta for user group applications, which sends mail to this list (see T95789). It just sends to the e... [21:34:51] (03PS1) 10GWicke: Set static metric / log names [puppet] - 10https://gerrit.wikimedia.org/r/229571 [21:34:53] (03PS1) 10Yuvipanda: k8s: Do not specify cafile explicitly for flannel [puppet] - 10https://gerrit.wikimedia.org/r/229572 [21:34:59] (03CR) 10jenkins-bot: [V: 04-1] k8s: Do not specify cafile explicitly for flannel [puppet] - 10https://gerrit.wikimedia.org/r/229572 (owner: 10Yuvipanda) [21:35:17] (03PS2) 10Yuvipanda: k8s: Do not specify cafile explicitly for flannel [puppet] - 10https://gerrit.wikimedia.org/r/229572 [21:35:29] chasemp: https://gerrit.wikimedia.org/r/#/c/229571/ is a small follow-up fix from the deploy [21:35:52] gwicke: you ready for merge now? [21:35:59] chasemp: yes [21:36:03] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Do not specify cafile explicitly for flannel [puppet] - 10https://gerrit.wikimedia.org/r/229572 (owner: 10Yuvipanda) [21:36:32] (03CR) 10Rush: [C: 032] Set static metric / log names [puppet] - 10https://gerrit.wikimedia.org/r/229571 (owner: 10GWicke) [21:36:52] (03PS2) 10Rush: Set static metric / log names [puppet] - 10https://gerrit.wikimedia.org/r/229571 (owner: 10GWicke) [21:36:56] 6operations, 10Wikimedia-Mailing-lists: Rename usergroups@ to usergroup-applications@ - https://phabricator.wikimedia.org/T108099#1512532 (10Slaporte) >>! In T108099#1512511, @Krenair wrote: >> We also need to simultaneously update the contact form on Meta for user group applications, which sends mail to this... [21:37:11] chasemp: thanks! [21:37:17] gwicke: I'm not sure it's going to let me merge as is [21:37:18] Please rebase the change locally and upload again for review. [21:37:22] I tried ui rebase [21:38:17] chasemp I think you just conflicted with me. your rebase finished fine [21:39:31] odd error message [21:39:57] yes [21:40:02] it's definitely rebased in gerrit [21:40:45] it's merged [21:40:48] I had to +2 again and do a dance [21:40:51] should be cool [21:40:58] ;) [21:42:13] (03PS1) 10Yuvipanda: k8s: Don't use quotes when passing params to flannel [puppet] - 10https://gerrit.wikimedia.org/r/229575 [21:42:18] (03CR) 10jenkins-bot: [V: 04-1] k8s: Don't use quotes when passing params to flannel [puppet] - 10https://gerrit.wikimedia.org/r/229575 (owner: 10Yuvipanda) [21:42:26] (03PS2) 10Yuvipanda: k8s: Don't use quotes when passing params to flannel [puppet] - 10https://gerrit.wikimedia.org/r/229575 [21:42:35] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Don't use quotes when passing params to flannel [puppet] - 10https://gerrit.wikimedia.org/r/229575 (owner: 10Yuvipanda) [21:44:05] !log deployed cherry-picked ba49b80bdc3a156604eb3996830af0d5bc45c503 hotfix to the parsoid cluster to deal with crashers from deploy earlier today [21:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:53:46] (03PS1) 10Yuvipanda: k8s: Hook up docker to flannel [puppet] - 10https://gerrit.wikimedia.org/r/229579 [21:53:51] (03CR) 10jenkins-bot: [V: 04-1] k8s: Hook up docker to flannel [puppet] - 10https://gerrit.wikimedia.org/r/229579 (owner: 10Yuvipanda) [21:53:58] (03PS2) 10Yuvipanda: k8s: Hook up docker to flannel [puppet] - 10https://gerrit.wikimedia.org/r/229579 [21:54:49] (03CR) 10Yuvipanda: [C: 032] k8s: Hook up docker to flannel [puppet] - 10https://gerrit.wikimedia.org/r/229579 (owner: 10Yuvipanda) [21:55:58] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint: Need sudo to blazegraph on wdqs1001/1002 - https://phabricator.wikimedia.org/T107819#1512662 (10Smalyshev) [21:57:15] (03PS1) 10Yuvipanda: k8s: Explicitly declare the docker service [puppet] - 10https://gerrit.wikimedia.org/r/229581 [21:57:20] (03CR) 10jenkins-bot: [V: 04-1] k8s: Explicitly declare the docker service [puppet] - 10https://gerrit.wikimedia.org/r/229581 (owner: 10Yuvipanda) [21:57:27] (03PS2) 10Yuvipanda: k8s: Explicitly declare the docker service [puppet] - 10https://gerrit.wikimedia.org/r/229581 [21:57:34] is trying to make git fetch happen [21:57:50] yeah :) [21:57:51] I just did one [21:58:04] but I'll have to do one right after merging that [21:58:29] (03CR) 10Yuvipanda: [C: 032] k8s: Explicitly declare the docker service [puppet] - 10https://gerrit.wikimedia.org/r/229581 (owner: 10Yuvipanda) [22:03:06] (03PS1) 10Smalyshev: T103907: restrict further passed URLs [puppet] - 10https://gerrit.wikimedia.org/r/229584 [22:10:11] (03PS1) 10Dzahn: admin: add mailman root group and add john [puppet] - 10https://gerrit.wikimedia.org/r/229585 (https://phabricator.wikimedia.org/T108082) [22:10:56] (03CR) 10jenkins-bot: [V: 04-1] admin: add mailman root group and add john [puppet] - 10https://gerrit.wikimedia.org/r/229585 (https://phabricator.wikimedia.org/T108082) (owner: 10Dzahn) [22:11:37] robh, still around? [22:11:53] yep [22:12:12] RECOVERY - Outgoing network saturation on labstore1002 is OK Less than 10.00% above the threshold [75000000.0] [22:12:20] do you ever touch ldap groups? [22:12:44] nope, but i followed along on the last nda [22:12:47] so i wouldnt mind doing one. [22:12:58] ive looked them up [22:13:00] but not appended [22:13:03] 6operations, 6Release-Engineering, 6Zero, 7Mobile, 7Technical-Debt: Pull WikipediaMobileFirefoxOS from mediawiki-config - https://phabricator.wikimedia.org/T107172#1512747 (10jhobs) I was on vacation last week and am thus just getting to this now. I have a meeting with the Zero team tomorrow and I'll bri... [22:13:18] I made a list of 6 people who probably shouldn't be there [22:13:22] anymore [22:13:48] add to a phab task and make it other under security? [22:14:01] ok [22:14:06] no reason to publish that they are still there until AFTER we fix [22:14:13] i'm fine with removing the private flags post implementation. [22:14:17] (03PS1) 10Dzahn: admin: add user for John F. Lewis [puppet] - 10https://gerrit.wikimedia.org/r/229587 (https://phabricator.wikimedia.org/T108082) [22:14:47] but yea, im happy to handle the removals. I haven't done it yet, but that seems like more reason to do so. [22:14:49] =] [22:15:01] JohnFLewis is a puppet? :O [22:15:23] pretty sure its more like daniel and i are john's ops puppets.... ;D [22:15:46] haha [22:16:26] mailman migration is progressing \o/ [22:16:29] (03CR) 10Dzahn: "this is after: https://gerrit.wikimedia.org/r/229587" [puppet] - 10https://gerrit.wikimedia.org/r/229585 (https://phabricator.wikimedia.org/T108082) (owner: 10Dzahn) [22:16:33] robh, the list of users in the group is accessible to anyone who can get into labs [22:16:40] oh [22:16:40] but sure [22:16:44] then no need to make private. [22:16:55] see how much I know about ldap? [22:17:01] :) [22:17:44] (03CR) 10John F. Lewis: [C: 031] "Yeah - this is me, my key, my name and my uid." [puppet] - 10https://gerrit.wikimedia.org/r/229587 (https://phabricator.wikimedia.org/T108082) (owner: 10Dzahn) [22:19:01] (03PS3) 10Dzahn: Add query.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/228411 (https://phabricator.wikimedia.org/T107602) (owner: 10JanZerebecki) [22:23:08] (03CR) 10Dzahn: [C: 032] Add query.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/228411 (https://phabricator.wikimedia.org/T107602) (owner: 10JanZerebecki) [22:27:05] !log es1.7.1 upgrade on elastic1015 [22:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:27:13] !log hoo Synchronized php-1.26wmf16/extensions/Wikidata/: Update Wikibase: Fix use class in CallbackFactory (duration: 00m 20s) [22:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:27:43] !log hoo Synchronized php-1.26wmf17/extensions/Wikidata/: Update Wikibase: Fix use class in CallbackFactory (duration: 00m 21s) [22:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:29:31] !log Started dumpwikidatajson.sh on snapshot1003 again to create a Wikidata json dump after earlier attempts this week and today failed. [22:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:31:12] (03CR) 10Catrope: [C: 031] Disable Special:NewMessages on wiki with LiquidThreads frozen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229460 (https://phabricator.wikimedia.org/T107898) (owner: 10Sbisson) [22:33:26] (03CR) 10BryanDavis: beta: Configure $wgStatsdServer and $wgStatsdMetricPrefix (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229565 (https://phabricator.wikimedia.org/T108091) (owner: 10BryanDavis) [22:35:34] 6operations, 6Release-Engineering, 6Zero, 7Mobile, 7Technical-Debt: Pull WikipediaMobileFirefoxOS from mediawiki-config - https://phabricator.wikimedia.org/T107172#1512863 (10MaxSem) >>! In T107172#1489893, @demon wrote: >> is that actually receiving traffic? > > Good question. I hope not. ``` hive (d... [22:36:00] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1422 bytes in 0.209 second response time [22:36:09] hoo: ? ^ [22:36:40] jzerebecki: [22:36:51] *sigh* [22:38:12] seems we can't keep up with https://www.wikidata.org/wiki/User:KrBot ... [22:38:30] looking [22:39:32] Warning: array_key_exists() expects parameter 2 to be array, null given in /srv/mediawiki/php-1.26wmf17/extensions/Wikidata/extensions/Wikibase/lib/includes/changes/DiffChange.php on line 95 [22:39:33] uh [22:40:24] woah [22:40:28] (03PS3) 10Dzahn: wikidata query: add misc-web configuration [puppet] - 10https://gerrit.wikimedia.org/r/229392 (https://phabricator.wikimedia.org/T107602) (owner: 10Giuseppe Lavagetto) [22:40:35] Krenair: Look into hhvm log on fluorine [22:40:51] umm [22:40:53] ok [22:40:58] (03CR) 10Dzahn: "PS3 only fixed the whitespace" [puppet] - 10https://gerrit.wikimedia.org/r/229392 (https://phabricator.wikimedia.org/T107602) (owner: 10Giuseppe Lavagetto) [22:41:01] Sorry, meant Krinkle [22:41:05] woah [22:41:06] Damn autocomplete [22:41:32] message repeated 853 times: [ #012Warning: array_map() expects parameter 1 to be a valid callback in /srv/mediawiki/php-1.26wmf17/includes/resourceloader/ResourceLoaderFileModule.php on line 580 [22:42:35] heh [22:42:44] i dunno who updated clinic duty page with ldap info, but thanks! [22:42:47] likely was mutante. [22:43:11] jzerebecki: Will open a ticket about the warnings [22:43:17] hoo: ? [22:43:25] or andrewbogott, he does a bunch of ldap stuff [22:43:30] The Wikibase ones [22:43:33] either way, huzzah for good documentation. [22:43:51] Krinkle: Weren't you working on these earlier today? [22:44:14] hoo: Yeah, that warning suggests safeFileHash doesn't exist [22:44:44] But I synced ResourceLoader.php before ResourceLoaderFileModule.php.php [22:44:46] which has that method [22:44:57] hoo: I don't find yours in logstash [22:44:59] hoo: Any particular wiki or mw server? Happening recently? [22:45:17] Krenair: all of the appservers it looks [22:45:18] jzerebecki: Yeah, it's a warning only [22:45:24] and only happening on Zend servers [22:45:36] Krinkle: Flooding right now [22:45:50] only on zend? wait, this is hhvm.log [22:45:53] A lot of servers [22:46:02] hoo: OK. checking [22:46:04] mutante: We're talking about two problems here [22:46:11] One in Wikibase and one in ResourceLoader [22:46:18] hoo: ah,ok [22:47:01] woah. logstash says that RL one has happened ~250K times in the last hour [22:47:01] !log krinkle Synchronized php-1.26wmf16/includes/resourceloader/ResourceLoaderModule.php: T104950 (duration: 00m 13s) [22:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:47:19] !log krinkle Synchronized php-1.26wmf17/includes/resourceloader/ResourceLoaderModule.php: T104950 (duration: 00m 12s) [22:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:47:29] That should fix it [22:47:33] I hate per- file syncing [22:47:36] it's easy to sync the wrong file [22:47:48] I synced ResourceLoader.php but the method is in ResourceLoaderModule.php [22:47:52] Yeah, looks good now [22:47:56] It failed gracefully though [22:47:59] no user impact [22:48:01] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1409 bytes in 0.114 second response time [22:48:16] Thanks hoo [22:48:41] It is indeed easy to mess up file syncs. I've thought of trying to automate that (sync-hash GIT_HASH) but never written the code [22:49:39] i like how the monitoring worked and told us [22:49:43] we just adjusted that the other day [22:50:06] mutante: Indeed :) [22:52:56] (03CR) 10Dzahn: [C: 031] wikidata query: add misc-web configuration [puppet] - 10https://gerrit.wikimedia.org/r/229392 (https://phabricator.wikimedia.org/T107602) (owner: 10Giuseppe Lavagetto) [22:54:42] (03CR) 10Gergő Tisza: beta: Configure $wgStatsdServer and $wgStatsdMetricPrefix (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229565 (https://phabricator.wikimedia.org/T108091) (owner: 10BryanDavis) [22:54:57] (03CR) 10Dzahn: "merged the DNS change - query.wikidata.org points to misc-web" [puppet] - 10https://gerrit.wikimedia.org/r/229392 (https://phabricator.wikimedia.org/T107602) (owner: 10Giuseppe Lavagetto) [22:58:11] (03PS2) 10BryanDavis: beta: Configure $wgStatsdServer and $wgStatsdMetricPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229565 (https://phabricator.wikimedia.org/T108091) [22:58:26] (03CR) 10BryanDavis: beta: Configure $wgStatsdServer and $wgStatsdMetricPrefix (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229565 (https://phabricator.wikimedia.org/T108091) (owner: 10BryanDavis) [23:00:04] RoanKattouw ostriches rmoen Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150805T2300). Please do the needful. [23:00:04] James_F bd808: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:10] * James_F waves. [23:00:32] !log es1.7.1 upgrade on elastic1016 [23:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:00:40] o/ [23:01:24] !log ori Synchronized php-1.26wmf17/extensions/EducationProgram: I2089b21fc (duration: 00m 13s) [23:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:01:37] is gerrit back to doing submodule bumps on merge again? [23:02:09] 6operations: Create an offboarding workflow with IT & Operations - https://phabricator.wikimedia.org/T108131#1512945 (10RobH) a:3RobH [23:02:39] bd808: Not for VE. But otherwise yes. [23:02:42] bd808: I think. [23:03:03] not for VE but your commit is a submodule bump for VE? [23:03:28] 6operations: Create an offboarding workflow with IT & Operations - https://phabricator.wikimedia.org/T108131#1512967 (10RobH) p:5Triage>3Normal I'll be chatting with @Jkrauska abou this tomorrow. [23:03:45] 6operations: Create an offboarding workflow with IT & Operations - https://phabricator.wikimedia.org/T108131#1512971 (10Dzahn) fwiw, we need to solve the same problem for on-boarding too, people who get hired regularly lack the WMF LDAP group and we'd need some kind of notification here [23:05:06] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1512972 (10CCogdill_WMF) > The fact that T107977 (local creation/routing of anna@benefactorevents.wikimedia.org) was created leads me to think that Tri... [23:05:21] I can do the swat today if nobody else is rushing to do so [23:05:55] I'll do my dumb little config change first and then sort out James_F's submodule bump [23:06:08] Ta. [23:06:43] ori: you all done on tin? [23:07:19] yes [23:07:32] cool beans [23:07:42] (03CR) 10BryanDavis: [C: 032] beta: Configure $wgStatsdServer and $wgStatsdMetricPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229565 (https://phabricator.wikimedia.org/T108091) (owner: 10BryanDavis) [23:07:48] (03Merged) 10jenkins-bot: beta: Configure $wgStatsdServer and $wgStatsdMetricPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229565 (https://phabricator.wikimedia.org/T108091) (owner: 10BryanDavis) [23:07:49] 6operations: Create an offboarding workflow with IT & Operations - https://phabricator.wikimedia.org/T108131#1512978 (10Krenair) >>! In T108131#1512971, @Dzahn wrote: > fwiw, we need to solve the same problem for on-boarding too, people who get hired regularly lack the WMF LDAP group and we'd need some kind of n... [23:09:02] !log bd808 Synchronized wmf-config/CommonSettings.php: beta: Configure and (I7d20abb) (duration: 00m 13s) [23:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:16] heh. needed single quotes [23:12:02] 6operations, 6Collaboration-Team Backlog, 10Collaboration-Team-Current, 10Flow: Setup separate logical External Store for Flow - https://phabricator.wikimedia.org/T107610#1512993 (10Mattflaschen) p:5Unbreak!>3High [23:12:17] tgr: we has stats at https://graphite.wmflabs.org/dashboard/ [23:12:35] ok, time to find out if gerrit will do all the hard work for me [23:13:23] 6operations: Create an offboarding workflow with IT & Operations - https://phabricator.wikimedia.org/T108131#1512999 (10RobH) Discussion about this with @dzahn resulted in some nice ideas. Namely, we can populate the https://wikitech.wikimedia.org/wiki/Operations_requests Operations Help Page with the links to... [23:13:28] backporting something bd808? [23:13:55] Krenair: yup -- https://gerrit.wikimedia.org/r/#/c/229441/ [23:14:16] I expect you'll have to do the submodule update manually [23:14:25] 6operations: Create an offboarding workflow with IT & Operations - https://phabricator.wikimedia.org/T108131#1513000 (10RobH) We wouldn't automatically add the rights from the email onboard notification. It would just be a confirmation of 'yes this person works for us so you can expect other access requests.'... [23:14:26] sadness [23:14:37] Because it's the VE repo [23:14:41] (03PS5) 10Ori.livneh: Move api listing rewrite rules to main project domains [puppet] - 10https://gerrit.wikimedia.org/r/229219 (owner: 10GWicke) [23:16:32] gwicke: I'm going to stage this change on mw1017 first. You can force Varnish to route your requests (from any wiki) to that Apache via the X-Wikimedia-Debug header [23:16:51] or use https://chrome.google.com/webstore/detail/wikimediadebug/binmakecefompkjggiklgjenddjoifbb?hl=en-US [23:16:54] okay [23:19:31] 6operations: Create an offboarding workflow with IT & Operations - https://phabricator.wikimedia.org/T108131#1513019 (10Dzahn) >>! In T108131#1512978, @Krenair wrote: > I'm not sure I like the idea of automatically granting rights like this to new staff... I did not intend to say we should change anything about... [23:19:41] gwicke: https://test.wikipedia.org/api/ [23:19:59] LGTM -- do you want to poke at it before I merge? [23:21:00] mischief managed -- https://gerrit.wikimedia.org/r/#/c/229599/ [23:21:25] ori: looks good to me as well [23:21:34] thanks! [23:21:50] (03CR) 10Ori.livneh: [C: 032] Move api listing rewrite rules to main project domains [puppet] - 10https://gerrit.wikimedia.org/r/229219 (owner: 10GWicke) [23:22:23] now to make that listing look less horrible.. [23:27:35] 6operations, 6WMF-Legal, 7domains: transfer wikimedia.lt/wikipedia.lt over to MarkMonitor - https://phabricator.wikimedia.org/T87466#1513064 (10Dzahn) [23:28:53] James_F: not asleep over here, jsut waiting for our buddy jenkins [23:29:13] bd808: Oh, I know, I've got the screen in front of me showing zuul status. [23:30:04] woot [23:32:12] !log bd808 Synchronized php-1.26wmf17/extensions/VisualEditor/extension.json: VisualEditor b/c anon IP module name fix (Ia92ecc0) (duration: 00m 12s) [23:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:32:20] James_F: ^ [23:32:37] * James_F checks. [23:33:27] bd808: Yeah, LGTM. [23:33:37] awesomsauce [23:33:56] that concludes SWAT folks. remember to tip your waiter and drive safe [23:34:11] ly [23:34:12] Driving safe is so boring though [23:34:46] 6operations, 6WMF-Legal, 7domains: expiring Domain wikipedia.lt? - https://phabricator.wikimedia.org/T88877#1513104 (10Dzahn) [23:35:10] ori: no I mean literally drive a safe. Or an armored car [23:35:31] ahhh that's much better [23:35:54] this would work too -- http://madmax.wikia.com/wiki/Doof_Wagon [23:36:28] HAHA [23:37:19] !log ori Synchronized php-1.26wmf17/extensions/FlaggedRevs: I2089b21fc (duration: 00m 13s) [23:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:37:37] By my deeds I honor him. V8 [23:38:27] (03PS1) 10Yuvipanda: k8s: Use custom systemd unit for docker to get flannel support [puppet] - 10https://gerrit.wikimedia.org/r/229601 [23:38:32] bd808: Thanks. :-) [23:38:39] yw [23:38:41] (03PS2) 10Yuvipanda: k8s: Use custom systemd unit for docker to get flannel support [puppet] - 10https://gerrit.wikimedia.org/r/229601 [23:40:11] 6operations, 10vm-requests: (do not) request VM for grafana - https://phabricator.wikimedia.org/T107832#1513115 (10Dzahn) 5Open>3declined [23:40:12] 6operations, 5Patch-For-Review: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1513116 (10Dzahn) [23:40:31] 6operations, 5Patch-For-Review: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1434202 (10Dzahn) [23:40:32] 6operations, 10vm-requests: (do not) request VM for grafana - https://phabricator.wikimedia.org/T107832#1505304 (10Dzahn) [23:40:46] 6operations, 5Patch-For-Review: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1434202 (10Dzahn) move it to "krypton" instead (existing VM) [23:40:53] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Use custom systemd unit for docker to get flannel support [puppet] - 10https://gerrit.wikimedia.org/r/229601 (owner: 10Yuvipanda) [23:41:07] 6operations: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1513126 (10Dzahn) [23:41:07] James_F: coffee? [23:41:27] Jamesofur: Meh. Sure. [23:41:39] excitement I see, you clearly need it :) [23:41:43] meet you on G or 3? [23:42:10] * Jamesofur assumes G [23:49:31] 6operations, 6Services: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#1513199 (10Dzahn) @cscott thank you very much for the additional comments and clarification. i'll move forward with this [23:56:36] !log ori Synchronized wmf-config/CommonSettings.php: Unset $wgDiff (duration: 00m 12s) [23:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master