[00:13:46] (03PS20) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [00:14:45] (03CR) 10jenkins-bot: [V: 04-1] keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [00:19:34] 06Operations, 10VisualEditor experimentation: reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2321621 (10Peachey88) [00:20:31] (03CR) 1020after4: [C: 031] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [00:26:37] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 690 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6514609 keys - replication_delay is 690 [00:34:26] RECOVERY - WDQS SPARQL on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 10564 bytes in 0.003 second response time [00:34:37] RECOVERY - WDQS HTTP on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 10564 bytes in 0.045 second response time [00:40:46] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [00:41:16] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6489964 keys - replication_delay is 0 [00:48:11] 06Operations, 10Traffic, 07Browser-Support-Firefox, 07HTTPS: Secure connection failed when attempting to send POST request - https://phabricator.wikimedia.org/T134869#2280304 (10Elvey) Seeing this error too. For me too, "it occurs when saving almost any edit", (or at least most edits) since a couple days a... [00:54:05] (03PS21) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [00:57:51] @seen papaul [00:57:51] mutante: Last time I saw papaul they were leaving the channel #wikimedia-operations at 5/24/2016 12:25:09 AM (32m42s ago) [01:07:36] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:09:07] papaul_: hi on irssi [01:09:26] this is the bot command i used [01:09:28] @seen papaul [01:09:28] mutante: Last time I saw papaul they were leaving the channel #wikimedia-operations at 5/24/2016 12:25:09 AM (44m19s ago) [01:12:50] (03PS4) 10Foks: Adding WMF Support and Safety user groups to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) [01:16:41] mutante: hey [01:16:54] (03CR) 10Jalexander: [C: 031] Adding WMF Support and Safety user groups to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [01:25:05] papaul_: yep [01:54:05] papaul_: so yeah [01:54:14] you should be able to id now and change your nick to just 'papaul' [01:54:21] robh: i am trying now [01:54:28] if it says its in use, try /msg nickserv release papaul [01:54:35] ok [01:54:35] or /msg nickserv ghost papaul [01:54:41] usually release if its in use but not online [01:54:44] ghost if it shows online. [01:56:28] but you cannot do either until you identify with nickserv as papaul [01:58:57] (Or specify a password with the command.) [01:59:20] I had to ghost a bot account earlier today. [02:18:23] win 1 [02:25:20] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.2) (duration: 11m 38s) [02:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:36:29] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 79465 MB (15% inode=99%) [02:44:37] RECOVERY - Disk space on elastic1016 is OK: DISK OK [02:46:56] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 612 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6503422 keys - replication_delay is 612 [03:23:28] 06Operations, 10VisualEditor experimentation: reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2321856 (10Dzahn) [03:23:30] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2321855 (10Dzahn) [03:24:27] 06Operations, 10VisualEditor experimentation: reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2201664 (10Dzahn) @Peachey88 that wasn't a blocking task for T123525 since this a trusty system and not precise [03:50:28] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6491140 keys - replication_delay is 0 [05:16:18] PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100% [05:27:25] .oob = out of band, which makes me notice it but not treat as emergency [05:28:46] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 38.20 ms [05:34:58] PROBLEM - test icmp reachability to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 132 probes of 394 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [05:42:48] <_joe_> mutante: I think it's related to some maintenance? [05:50:38] _joe_: i see one for UnitedLayer on May 10th to upgrade their core routers but not really today (on the maint-announce@) but already recovered anyways [05:51:01] not sure who provides the oob link [05:51:43] or if the ripe-atlas can be related to it [05:52:43] i mean the probes in the icinga check above [06:06:42] (03PS5) 10Mobrovac: RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 [06:25:33] (03CR) 10Giuseppe Lavagetto: [C: 031] "My comments are mostly "for the future", the patch looks ok." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [06:29:52] <_joe_> mobrovac: +2 whenever you want to deploy [06:30:18] PROBLEM - test icmp reachability to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 21 probes of 398 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [06:30:48] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:58] kk thnx _joe_, still need to test it in staging [06:31:17] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:37] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:45] Hi, can someone take a look at gerrit? The connection is very slow for me, while others are working fine. Is that just me, or is this a general issue? [06:38:05] works for me just fine [06:41:07] RECOVERY - test icmp reachability to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 394 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:41:26] RECOVERY - test icmp reachability to eqiad on ripe-atlas-eqiad is OK: OK - failed 13 probes of 398 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [06:41:55] For since 5 minutes.... Can be my connection too, but other sites are loading as fast as at other days, so I wondering why I just have at gerrit and at labs instances a performance issue [06:54:56] 06Operations, 10Traffic: Ashburn servers almost unreachable from part of Europe - https://phabricator.wikimedia.org/T136067#2321971 (10Nemo_bis) [06:55:44] _joe_, paravoid, issues with Telia https://phabricator.wikimedia.org/T136067 [06:57:07] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:28] (03CR) 10Luke081515: [C: 031] Adding WMF Support and Safety user groups to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [06:57:38] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:16] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:37] <_joe_> Nemo_bis: ouch, let me check [07:03:58] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 683 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6506136 keys - replication_delay is 683 [07:07:04] 06Operations, 10Traffic: Ashburn servers almost unreachable from part of Europe - https://phabricator.wikimedia.org/T136067#2321998 (10Nemo_bis) [07:09:50] 06Operations, 10Traffic: Ashburn servers almost unreachable from part of Europe - https://phabricator.wikimedia.org/T136067#2321999 (10Nemo_bis) 05Open>03Resolved a:03Nemo_bis Looks like a temporary capacity problem now solved by either Telia or the ISPs... ``` $ mtr -w -c 50 wikitech.wikimedia.org Star... [07:10:34] (03PS6) 10Mobrovac: RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 [07:13:45] (03CR) 10Mobrovac: RESTBase: Set up rate limiting (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [07:16:03] (03CR) 10Mobrovac: "PCC's happy again - https://puppet-compiler.wmflabs.org/2889/" [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [07:17:41] !log change-prop deploying 20eda89 [07:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:32:59] !log restbase deploy start of a5d00d1 [07:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:35:06] 06Operations, 10ops-codfw: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2322016 (10Joe) @Papaul @RobH let's not reuse hostnames, we didn't do that in eqiad either. We can start thinking about reusing hostnames when 2-300 slots are free, in a veery distant future :) [07:35:21] 06Operations, 10ops-codfw: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2322021 (10Joe) a:05Joe>03Papaul [07:36:07] 06Operations, 10ops-codfw: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2299933 (10Joe) a:05Papaul>03Joe [07:37:03] 06Operations, 10ops-codfw: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2299933 (10Joe) So I will start decommissioning the servers we want to dismiss. [07:42:55] (03PS1) 10Elukey: Fix memcached gmond module for Python syntax error. [puppet] - 10https://gerrit.wikimedia.org/r/290394 (https://phabricator.wikimedia.org/T129963) [07:44:14] (03CR) 10Elukey: [C: 032] Fix memcached gmond module for Python syntax error. [puppet] - 10https://gerrit.wikimedia.org/r/290394 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [07:47:24] this --^ is really bad and mostly my fault. We wanted to add a metric and ended up knoking down the memcached gmond module. [07:48:00] so if you see a hole in ganglia for memcached metrics, it was me [07:56:35] 06Operations, 07HHVM, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2322035 (10Joe) @Anomie is this a blocker? I thought unicode standards didn't break anything between minor versions, and from what I can see both here: https://en.wikibooks.org/wi... [07:56:51] !log restbase deploy end of a5d00d1 [07:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:00:06] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2322036 (10ori) This data is now in Graphite, too. For example: https://graphite.wikimedia.org/S/BW . [08:00:37] ok this is weird, only some of the memcached metrics have holes and some hosts were already not pushing any data since wed last week [08:01:13] (03PS2) 10Muehlenhoff: Enable base::firewall on potassium [puppet] - 10https://gerrit.wikimedia.org/r/286148 [08:08:46] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on potassium [puppet] - 10https://gerrit.wikimedia.org/r/286148 (owner: 10Muehlenhoff) [08:10:16] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2322038 (10elukey) >>! In T129963#2322036, @ori wrote: > This data is now in Graphite, too. For example: https://graphite.wikimedia.org/S/BW . Ah I didn't get... [08:12:19] !log enabled base::firewall on potassium (pool counter) [08:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:16:21] (03PS1) 10Muehlenhoff: Restore alphabetic order in site.pp after holmium rename [puppet] - 10https://gerrit.wikimedia.org/r/290402 [08:19:13] (03PS7) 10Mobrovac: RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 [08:19:26] (03CR) 10Muehlenhoff: [C: 032 V: 032] Restore alphabetic order in site.pp after holmium rename [puppet] - 10https://gerrit.wikimedia.org/r/290402 (owner: 10Muehlenhoff) [08:24:41] (03CR) 10Luke081515: "Remeber that userrights-interwiki makes only real sense together with userrights, because userrights-interwiki only gives you the permissi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [08:25:07] (03PS1) 10Muehlenhoff: Enable base::firewall in the pool counter role [puppet] - 10https://gerrit.wikimedia.org/r/290405 [08:26:52] (03PS8) 10Mobrovac: RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 [08:34:33] (03PS1) 10Giuseppe Lavagetto: mediawiki: decommission old codfw appservers [puppet] - 10https://gerrit.wikimedia.org/r/290407 (https://phabricator.wikimedia.org/T135468) [08:35:10] (03CR) 10Nikerabbit: [C: 031] Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [08:38:14] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2322096 (10elukey) Very basic one: https://grafana.wikimedia.org/dashboard/db/memcached [08:41:28] (03CR) 10Muehlenhoff: "No idea about the Cassandra config, but the ferm part is fine" [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [08:45:56] !log reboot restbase2006 for multi-instance conversion T113714 [08:45:57] T113714: Separate /var on restbase - https://phabricator.wikimedia.org/T113714 [08:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:47:06] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (10Storkk) Also happened four or five times on: API request failed (intern... [08:48:20] (03CR) 10Alexandros Kosiaris: [C: 031] RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [08:53:00] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2322113 (10elukey) Remaining issues: 1) I can't find the new get_hits_ratio metric on ganglia, not sure if we need to add more settings to enable it. 2) mc101... [08:57:58] (03PS1) 10Aklapper: Add "Lua" to syntax highlighting dropdown choices in Phab's "Paste" [puppet] - 10https://gerrit.wikimedia.org/r/290409 (https://phabricator.wikimedia.org/T100900) [08:58:38] (03PS1) 10Elukey: Remove Spark dynamic maxExecutors setting since it is not needed. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290410 (https://phabricator.wikimedia.org/T101343) [08:59:59] (03CR) 10Filippo Giunchedi: [C: 031] RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [09:00:30] (03CR) 10Elukey: [C: 032] Remove Spark dynamic maxExecutors setting since it is not needed. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290410 (https://phabricator.wikimedia.org/T101343) (owner: 10Elukey) [09:02:02] (03PS1) 10Elukey: Update the CDH module with the last change for Spark default settings. [puppet] - 10https://gerrit.wikimedia.org/r/290412 (https://phabricator.wikimedia.org/T101343) [09:02:05] (03PS2) 10Filippo Giunchedi: cassandra: add restbase2006 instances [puppet] - 10https://gerrit.wikimedia.org/r/290244 (https://phabricator.wikimedia.org/T95253) [09:02:12] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase2006 instances [puppet] - 10https://gerrit.wikimedia.org/r/290244 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [09:02:18] (03CR) 10Elukey: [C: 032] Update the CDH module with the last change for Spark default settings. [puppet] - 10https://gerrit.wikimedia.org/r/290412 (https://phabricator.wikimedia.org/T101343) (owner: 10Elukey) [09:02:30] (03CR) 10Elukey: [V: 032] Update the CDH module with the last change for Spark default settings. [puppet] - 10https://gerrit.wikimedia.org/r/290412 (https://phabricator.wikimedia.org/T101343) (owner: 10Elukey) [09:02:36] (03PS2) 10Elukey: Update the CDH module with the last change for Spark default settings. [puppet] - 10https://gerrit.wikimedia.org/r/290412 (https://phabricator.wikimedia.org/T101343) [09:02:58] (03CR) 10Elukey: [V: 032] Update the CDH module with the last change for Spark default settings. [puppet] - 10https://gerrit.wikimedia.org/r/290412 (https://phabricator.wikimedia.org/T101343) (owner: 10Elukey) [09:08:23] (03CR) 10Mobrovac: [C: 031] "Cherry-picked on beta, both the patch and rate limiting work as advertised." [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [09:12:43] (03CR) 10Elukey: [C: 031] mediawiki: assign new eqiad appservers, install with jessie [puppet] - 10https://gerrit.wikimedia.org/r/290236 (owner: 10Giuseppe Lavagetto) [09:15:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of minor issues inline, otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [09:18:19] !log reboot restbase2003 for multi-instance conversion T113714 [09:18:20] T113714: Separate /var on restbase - https://phabricator.wikimedia.org/T113714 [09:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:28:32] 06Operations, 03Discovery-Search-Sprint, 13Patch-For-Review: Check Icinga alert on CirrusSearch response time - https://phabricator.wikimedia.org/T134852#2322150 (10Gehel) Change 290262 has been merged, but Graphite1001 (which should run that check) has puppet disabled, pending investigation on graphite loos... [09:31:58] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Wikidata Query Service REST endpoint returns truncated results - https://phabricator.wikimedia.org/T133490#2322153 (10ema) [09:32:02] 06Operations, 10Traffic: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2322154 (10ema) [09:32:06] 06Operations, 10Traffic, 13Patch-For-Review: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2322155 (10ema) [09:32:09] 06Operations, 10Traffic, 13Patch-For-Review: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2322151 (10ema) 05Open>03Resolved It's now been a week after our re-upgrade of cache_misc to Varnish 4 and I'm not aware of any new issues being discovered. Closing. [09:33:38] (03PS3) 10Filippo Giunchedi: cassandra: add restbase2003 instances [puppet] - 10https://gerrit.wikimedia.org/r/290237 [09:33:59] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase2003 instances [puppet] - 10https://gerrit.wikimedia.org/r/290237 (owner: 10Filippo Giunchedi) [09:47:20] gehel: the graphite checks run on the icinga machine btw, in this case neon [09:48:05] godog: yep, but it is exported by graphite1001... not really sure why... [09:48:55] godog: I corrected the comment [09:50:10] gehel: ah I see, not sure why either, I'm going to reenable puppet there anyways at this point [09:50:38] godog: thanks! We'll see if that check is of any use... [09:50:59] !log applying schema change to x1 hosts T135699 [09:51:00] T135699: Schema changes for Echo moderation - https://phabricator.wikimedia.org/T135699 [09:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:51] !log reenable puppet on graphite1001 T135385 [09:51:52] T135385: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385 [09:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:12:31] 06Operations, 10RESTBase: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#2322215 (10fgiunchedi) [10:12:34] 06Operations, 06Services, 10cassandra, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase - https://phabricator.wikimedia.org/T113714#2322213 (10fgiunchedi) 05Open>03Resolved this is completed, all restbase now have standalone `/srv` [10:16:30] 06Operations, 03Discovery-Search-Sprint, 13Patch-For-Review: Check Icinga alert on CirrusSearch response time - https://phabricator.wikimedia.org/T134852#2279774 (10fgiunchedi) I've reenabled puppet on graphite1001, the check should get exported soon! [10:20:54] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2322264 (10fgiunchedi) all restbase machines are multi-instance now, pending addition of additional instances... [10:32:00] !log creating backup of beta database just in case T119567 [10:32:01] T119567: Run Flow External Store migration in dry-run mode on Beta - https://phabricator.wikimedia.org/T119567 [10:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:33:19] hashar ^that should not impact anything on CI/beta, but just in case [10:34:06] (03PS1) 10Muehlenhoff: Add ferm rules for role::snapshot::dumper [puppet] - 10https://gerrit.wikimedia.org/r/290421 [10:34:08] (03PS1) 10Muehlenhoff: Enable base::firewall for snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/290422 [10:36:50] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic, 13Patch-For-Review: Varnishlog doesn't properly rotates logs, varnish.log is empty since forever (was: deployment-cache-upload04 (m1.medium) / is almost full) - https://phabricator.wikimedia.org/T135700#2322313 (10hashar) Puppet is broken on... [10:39:58] jynus: that might be just fine :) [10:40:21] jynus: eventually we would want to migrate / upgrade the beta cluster db host [10:40:31] I am not sure which OS or which mysql flavor they are currently using [10:49:50] (03CR) 10Hashar: "Filled as T136078" [puppet] - 10https://gerrit.wikimedia.org/r/276950 (owner: 10Hashar) [10:50:36] (03CR) 10jenkins-bot: [V: 04-1] Enable base::firewall for snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/290422 (owner: 10Muehlenhoff) [10:50:49] <_joe_> !log disabling puppet on mw2001-60 (minus 2017) for decommissioning [10:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:55:35] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Hiera hierarchy hieradata/role/* is not applied on labs (eg deployment-prep) - https://phabricator.wikimedia.org/T136080#2322414 (10hashar) [10:55:51] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Hiera hierarchy hieradata/role/* is not applied on labs (eg deployment-prep) - https://phabricator.wikimedia.org/T136080#2322431 (10hashar) [10:55:57] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Hiera hierarchy hieradata/role/* is not applied on labs (eg deployment-prep) - https://phabricator.wikimedia.org/T136080#2322414 (10hashar) [11:03:10] (03PS9) 10Giuseppe Lavagetto: RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [11:04:09] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [11:04:52] <_joe_> mobrovac: should I run puppet && restart restbase across the restbase_test cluster? [11:05:15] _joe_: yup, go ahead [11:06:34] <_joe_> ongoing [11:08:41] <_joe_> mobrovac: done [11:08:50] kk thnx _joe_ [11:08:52] will start test [11:09:40] <_joe_> I was looking for traffic over port 3050 on cerium but I don't see any [11:09:44] <_joe_> is that expected? [11:10:22] there should be some, but infrequent [11:11:13] <_joe_> should restbase keep an open socket on that port? [11:11:24] no socket up indeed [11:11:25] :/ [11:11:35] will try to find out what's going on [11:11:45] <_joe_> mobrovac: wait [11:11:58] <_joe_> I did a typo [11:12:17] i don't see start up msgs in logstash for restbase_test [11:12:18] <_joe_> so services not restarted [11:12:34] _joe_: k, let me restart them sequentially [11:12:39] <_joe_> eh [11:12:41] <_joe_> too late [11:12:59] haha [11:13:00] kk [11:13:03] <_joe_> I did already restart them :P [11:14:56] euh? [11:15:04] <_joe_> what's up? [11:15:06] why don't i see any startup msgs in logstash then? [11:15:25] kk, the socket is up on 3050 [11:15:31] <_joe_> look, cerium says Active: active (running) since Tue 2016-05-24 11:12:31 UTC; 2min 46s ago [11:15:34] <_joe_> :P [11:16:34] ah because they've ended up in the restbase dashboard, not restbase-test [11:16:37] *sigh* [11:16:42] we really have to fix this [11:16:45] <_joe_> lol [11:17:29] _joe_: kk, gimme around 20 mins to make sure all's good and we'll proceed to prod [11:17:30] k? [11:18:58] <_joe_> cool [11:20:46] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: Puppet has 6 failures [11:21:18] !log rolling restart of elasticsearch in logstash cluster to pick up openjdk security update [11:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:22:22] !log enabling GTID on s1 codfw db servers [11:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:46] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:35] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:41:02] that's me ^ [11:41:26] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6498967 keys - replication_delay is 0 [11:45:26] PROBLEM - HHVM rendering on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:26] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:15] PROBLEM - puppet last run on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:47:35] PROBLEM - salt-minion processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:47:56] _joe_: kk, all good [11:48:05] PROBLEM - RAID on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:07] _joe_: mind enabling puppet and running it in prod? [11:48:09] i can then restart [11:48:36] PROBLEM - Check size of conntrack table on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:47] PROBLEM - configured eth on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:05] PROBLEM - Disk space on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:06] PROBLEM - dhclient process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:16] PROBLEM - DPKG on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:16] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 19 minutes ago with 0 failures [11:49:17] PROBLEM - HHVM processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:25] <_joe_> mobrovac: ok [11:49:25] PROBLEM - nutcracker process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:51] 06Operations, 06Services, 10cassandra, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase - https://phabricator.wikimedia.org/T113714#2322520 (10mobrovac) >>! In T113714#2322213, @fgiunchedi wrote: > this is completed, all restbase now have standalone `/srv` \o/ Thank you @fgiunchedi ! [11:50:06] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [11:50:43] <_joe_> mobrovac: puppet is running, it will take some time [11:50:51] kk [11:52:27] PROBLEM - nutcracker port on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:52:33] (03CR) 10Hashar: [C: 031] mv files/misc/scripts/Makefile to scap module [puppet] - 10https://gerrit.wikimedia.org/r/289351 (owner: 10Dzahn) [11:52:36] PROBLEM - SSH on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:52:55] RECOVERY - configured eth on mw1142 is OK: OK - interfaces up [11:53:34] <_joe_> mobrovac: done [11:53:53] <_joe_> I suggest restarting codfw first [11:53:56] !log restbase resting nodes to pick up https://gerrit.wikimedia.org/r/#/c/290264/ [11:54:01] yup [11:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:59:16] PROBLEM - configured eth on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:32] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:10:18] <_joe_> !log cleaning puppet facts, salt keys for mw2001-2060 [12:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:13:23] RECOVERY - Disk space on mw1142 is OK: DISK OK [12:13:23] RECOVERY - nutcracker process on mw1142 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:13:31] RECOVERY - nutcracker port on mw1142 is OK: TCP OK - 0.000 second response time on port 11212 [12:13:32] RECOVERY - HHVM processes on mw1142 is OK: PROCS OK: 6 processes with command name hhvm [12:13:32] RECOVERY - salt-minion processes on mw1142 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:13:43] RECOVERY - SSH on mw1142 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [12:14:32] RECOVERY - RAID on mw1142 is OK: OK: no RAID installed [12:14:41] RECOVERY - Check size of conntrack table on mw1142 is OK: OK: nf_conntrack is 0 % full [12:14:51] RECOVERY - configured eth on mw1142 is OK: OK - interfaces up [12:15:12] RECOVERY - DPKG on mw1142 is OK: All packages OK [12:15:13] RECOVERY - dhclient process on mw1142 is OK: PROCS OK: 0 processes with command name dhclient [12:19:51] PROBLEM - Disk space on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:20:12] PROBLEM - HHVM processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:20:12] PROBLEM - salt-minion processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:20:21] PROBLEM - SSH on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:21:03] PROBLEM - RAID on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:21:11] PROBLEM - Check size of conntrack table on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:21:22] PROBLEM - configured eth on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:21:42] PROBLEM - DPKG on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:21:42] PROBLEM - dhclient process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:22:02] PROBLEM - nutcracker process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:22:12] PROBLEM - nutcracker port on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:51] PROBLEM - puppet last run on mw1145 is CRITICAL: CRITICAL: Puppet has 4 failures [12:32:59] !log deploying GTID to all codfw db production hosts [12:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:45:11] RECOVERY - dhclient process on mw1142 is OK: PROCS OK: 0 processes with command name dhclient [12:45:11] RECOVERY - Check size of conntrack table on mw1142 is OK: OK: nf_conntrack is 0 % full [12:45:30] RECOVERY - nutcracker port on mw1142 is OK: TCP OK - 0.000 second response time on port 11212 [12:45:32] RECOVERY - DPKG on mw1142 is OK: All packages OK [12:45:42] RECOVERY - nutcracker process on mw1142 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:45:50] RECOVERY - Disk space on mw1142 is OK: DISK OK [12:46:01] RECOVERY - HHVM processes on mw1142 is OK: PROCS OK: 6 processes with command name hhvm [12:46:11] RECOVERY - salt-minion processes on mw1142 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:46:40] RECOVERY - SSH on mw1142 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [12:47:01] RECOVERY - configured eth on mw1142 is OK: OK - interfaces up [12:47:01] RECOVERY - RAID on mw1142 is OK: OK: no RAID installed [12:49:08]  [12:49:47] !log stopping kafka on kafka1013 and rebooting the host for kernel upgrade [12:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:50:13] 06Operations, 10DBA, 10MediaWiki-Database, 07Performance: Implement GTID replication on MariaDB 10 servers - https://phabricator.wikimedia.org/T133385#2322671 (10jcrespo) Approximately half of the servers are using GTID, all the codfw slaves, all the external storage ones and some of the recently reimaged/... [12:53:47] <_joe_> looks like a few api appservers are ooming, can someone take a look? I am mostly afk now [12:55:30] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: processor/client-side-11 processor/client-side-07 processor/client-side-04 processor/client-side-03 processor/client-side-00 [12:55:33] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:55:46] event logging is due to me [12:57:31] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [13:04:59] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: puppet fail [13:07:58] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 639 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6508250 keys - replication_delay is 639 [13:08:56] I think mysql on beta has serious issues [13:09:10] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [10.0] [13:09:43] it could be just my user, but having into account I am root: "mysqldump: Couldn't execute 'show create table `bounce_records`': View 'labswiki.bounce_records' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them (1356)" [13:10:00] the kafka alert is from me, I silenced kafka1013 but 1022 is complaining [13:10:07] will clear in a bit with partition rebalancing [13:10:19] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [10.0] [13:10:39] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [10.0] [13:10:50] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [10.0] [13:11:25] RECOVERY - mysqld processes on labservices1002 is OK: PROCS OK: 1 process with command name mysqld [13:11:58] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [10.0] [13:12:30] heh, recovery page [13:13:36] ah, it seems to be just a view [13:13:50] I do not know if it is even worth a ticket [13:15:05] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2322732 (10Anomie) To tell the truth, I have no idea if it should be a blocker or not. But I went ahead and figured out how to update utfnormal, so the patche... [13:20:22] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2322752 (10StevenJ81) I've tried playing with this language in the drop-downs of multilingual projects like Meta and Wikidata. On Wikidata, wher... [13:23:41] !log reverted net.netfilter.nf_conntrack_tcp_timeout_time_wait on kafka1013 back to 65 (as it should have been set by sysctl.d) [13:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:26:16] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2322775 (10Anomie) I'd guess it's probably not a blocker since we probably use the intl extension rather than those data tables on the cluster, and the same f... [13:26:20] PROBLEM - Apache HTTP on mw1145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50408 bytes in 4.936 second response time [13:26:29] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] [13:26:59] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] [13:27:28] PROBLEM - HHVM rendering on mw1145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.012 second response time [13:27:29] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] [13:28:09] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK: OK: Less than 50.00% above the threshold [1.0] [13:28:30] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:29:16] (03PS1) 10Filippo Giunchedi: uwsgi: include app name in syslog [puppet] - 10https://gerrit.wikimedia.org/r/290454 [13:29:18] (03PS1) 10Filippo Giunchedi: graphite: split uwsgi logs to separate files [puppet] - 10https://gerrit.wikimedia.org/r/290455 [13:29:36] !log restarted hhvm on mw1145 (ran hhvm-dump-debug, output available in hhvm.14155.bt) [13:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:07] Out of memory: Kill process 14155 (hhvm) score 923 or sacrifice child [13:30:19] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] [13:30:48] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.094 second response time [13:32:00] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 66954 bytes in 0.546 second response time [13:33:48] PROBLEM - puppet last run on mw1145 is CRITICAL: CRITICAL: Puppet has 79 failures [13:36:45] Are labs having issues? [13:37:48] Josve05afk: what are you seeing? [13:37:52] (03PS3) 10Filippo Giunchedi: prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) [13:38:28] Been getting 502's off-and-on all day...(nginx/1.9.4) [13:41:28] RECOVERY - Check for gridmaster host resolution UDP on labs-ns1.wikimedia.org is OK: DNS OK - 0.020 seconds response time (tools-grid-master.tools.eqiad.wmflabs. 60 IN A 10.68.20.158) [13:43:19] (03PS4) 10Filippo Giunchedi: prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) [13:44:08] (asked in -labs) [13:45:33] (03CR) 10jenkins-bot: [V: 04-1] graphite: split uwsgi logs to separate files [puppet] - 10https://gerrit.wikimedia.org/r/290455 (owner: 10Filippo Giunchedi) [13:48:44] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T134755#2275897 (10mark) @jcrespo wants to keep es2001-2004 in their current state (racked, power on, with their current data) for another year as a safety measure. That's fine with me. Let'... [13:50:28] 06Operations, 03Discovery-Search-Sprint, 13Patch-For-Review: Check Icinga alert on CirrusSearch response time - https://phabricator.wikimedia.org/T134852#2322834 (10Gehel) I confirm, checks are now visible (and green) on [[ https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=graphite1001 | graphite10... [13:55:38] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:01:16] !log dropping labswiki.bounce_records on db1-BETA [14:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:26] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2322878 (10PleaseStand) >>! In T86096#2322775, @Anomie wrote: > I'd guess it's probably not a blocker since we probably use the intl extension rather than tho... [14:11:58] (03PS1) 10Aude: Make entityNamespaces setting available to Wikibase Client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290466 (https://phabricator.wikimedia.org/T136075) [14:15:51] 06Operations, 06Services, 10cassandra, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase - https://phabricator.wikimedia.org/T113714#2322890 (10Eevans) >>! In T113714#2322520, @mobrovac wrote: >>>! In T113714#2322213, @fgiunchedi wrote: >> this is completed, all restbase now have sta... [14:24:48] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T134755#2322912 (10jcrespo) @robh- my suggestion of follow up: * I will power on es2001-es2004, keep name and network for that year so there is not overhead on DC ops * Folowing Mark's reco... [14:26:37] (03PS2) 10Aude: Make entityNamespaces setting available to Wikibase Client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290466 (https://phabricator.wikimedia.org/T136075) [14:26:43] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2322920 (10PleaseStand) >>! In T86096#2322878, @PleaseStand wrote: > This might result in inconsistent normalization when some inputs contain invalid UTF-8 se... [14:33:10] (03PS2) 10Jcrespo: Remove dns entries for es2005-es2010 [dns] - 10https://gerrit.wikimedia.org/r/287645 (https://phabricator.wikimedia.org/T134755) [14:35:36] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T134755#2322939 (10jcrespo) I've updated https://gerrit.wikimedia.org/r/#/c/287645 to not remove es200[1234], ready to apply when needed. [14:36:53] (03CR) 10Daniel Kinzler: [C: 031] "I agree that we need this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290466 (https://phabricator.wikimedia.org/T136075) (owner: 10Aude) [14:37:51] (03CR) 10Aude: Enable Compact Language Links as default in Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [14:37:53] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2322942 (10Joe) @PleaseStand thanks for the detailed analisys. So, although this inconsistency doesn't frankly seem like a blocker for upgrading libicu at th... [14:38:50] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2322948 (10jcrespo) [14:46:33] (03PS2) 10Gehel: WIP experiments, just keeping that safe somewhere... [puppet] - 10https://gerrit.wikimedia.org/r/288691 [14:48:55] (03CR) 10jenkins-bot: [V: 04-1] WIP experiments, just keeping that safe somewhere... [puppet] - 10https://gerrit.wikimedia.org/r/288691 (owner: 10Gehel) [14:51:15] <_joe_> gehel: you can use git review -D and make it a draft [14:51:20] <_joe_> that would be only visible to you [14:52:08] _joe_: Thanks! I'll try that... too many things to learn about gerrit... [14:52:12] 06Operations, 10Ops-Access-Requests, 06Services: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2322977 (10Joe) @robh yes it was approved in the meeting and I was actually sponsoring this :) [14:52:16] (03PS1) 10Andrew Bogott: Document needed database steps when setting up a new labservices box [puppet] - 10https://gerrit.wikimedia.org/r/290471 (https://phabricator.wikimedia.org/T136065) [14:53:06] <_joe_> gehel: yw, I tend to use it for the first version embarassing code changes :P [14:53:31] (03Abandoned) 10Andrew Bogott: Keystone: Adopt a multi-domain model [puppet] - 10https://gerrit.wikimedia.org/r/244350 (owner: 10Andrew Bogott) [14:53:46] _joe_: yeah, that one is pretty embarrassing at this point... but I'm used to being embarrassed :P [14:54:08] 06Operations, 10Ops-Access-Requests, 06Services: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2322986 (10RobH) @gwicke: This was approved in the operations team meeting, but to ensure I ONLY add the right folks, can we confirm the users to be ad... [14:54:30] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#2322987 (10Andrew) [14:55:33] aude: Thanks for headsup. Let me see how I can fix it. [14:55:57] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Set up LVS for labs dns recursors - https://phabricator.wikimedia.org/T119660#2322993 (10Andrew) [14:56:00] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#1464173 (10Andrew) [14:56:05] kart_: ok [14:56:28] i would just keep the old setting and then make a follow up patch that removes it from initialise settings [14:57:20] if swat is just my patch and yours, then i would be okay with doing swat [14:57:39] Look like that. [14:57:43] (03PS1) 10Gehel: Specific configuration for new maps cluster. [puppet] - 10https://gerrit.wikimedia.org/r/290472 (https://phabricator.wikimedia.org/T134901) [14:58:12] (03CR) 10Andrew Bogott: [C: 032] Document needed database steps when setting up a new labservices box [puppet] - 10https://gerrit.wikimedia.org/r/290471 (https://phabricator.wikimedia.org/T136065) (owner: 10Andrew Bogott) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160524T1500). [15:00:04] kart_ aude: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:39] (03CR) 10Gehel: "Puppet compiler looks correct: https://puppet-compiler.wmflabs.org/2892/" [puppet] - 10https://gerrit.wikimedia.org/r/290472 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [15:01:16] I can SWAT: kart_ aude ping me when you're around [15:01:23] thcipriani: ok [15:01:33] thcipriani: around. [15:01:38] <_joe_> !log revoking certs, puppet facts, salt keys for mw2001-60 [15:01:43] ohai :) [15:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:01:51] thcipriani: we had discussion about my config patch. [15:02:17] thcipriani: please check, https://gerrit.wikimedia.org/r/#/c/290165/ [15:03:13] thcipriani: should I fix according to it or patch is OK? [15:04:31] it seems like if you sync initialisesettings.php that will just make new variables there, then you can sync commonsettings.php and it will find the new variables. Am I missing something important? [15:04:51] thcipriani: the old variable will then be briefly unavaialble to commonsettings [15:05:05] it would just briefly spam the logs with a undefined variable notice [15:05:07] ah, yeah, I see that now. [15:05:13] but otherwise maybe that's okay [15:05:19] if it's short enough [15:05:41] if my patch, i would remove the variable as a second follow up patch [15:06:01] kart_: if you could do that, it would be great ^ [15:06:06] OK [15:06:08] 06Operations: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094#2323036 (10MoritzMuehlenhoff) [15:06:10] Submitting. [15:06:33] Someday™ this will not be a problem. [15:06:40] (03PS2) 10Gehel: Specific configuration for new maps cluster. [puppet] - 10https://gerrit.wikimedia.org/r/290472 (https://phabricator.wikimedia.org/T134901) [15:06:45] or you could just remove the old variable first from commonsettings, it is not read ;) [15:07:48] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290466 (https://phabricator.wikimedia.org/T136075) (owner: 10Aude) [15:08:26] (03Merged) 10jenkins-bot: Make entityNamespaces setting available to Wikibase Client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290466 (https://phabricator.wikimedia.org/T136075) (owner: 10Aude) [15:08:51] (03CR) 10Gehel: "new puppet-compiler output still looks good: https://puppet-compiler.wmflabs.org/2893/" [puppet] - 10https://gerrit.wikimedia.org/r/290472 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [15:08:53] Nikerabbit: that too. I should have done that, but. [15:09:19] (03PS6) 10KartikMistry: Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) [15:10:06] anyway that works is good [15:10:42] 06Operations, 10Ops-Access-Requests, 06Services: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2323071 (10RobH) a:03RobH [15:11:50] !log thcipriani@tin Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:290466|Make entityNamespaces setting available to Wikibase Client wikis]] (duration: 00m 29s) [15:11:54] checking [15:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:09] uhh seeing lots of host-key verification failures [15:12:18] did we reimage like 50 hosts? [15:12:24] :( [15:13:08] any opsen around for that? ^ _joe_ ? [15:13:42] (03PS1) 10KartikMistry: Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290475 (https://phabricator.wikimedia.org/T134966) [15:13:49] the patch seems ok, though [15:14:13] that's good :) [15:14:22] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [15:15:09] (03PS3) 10Gehel: Specific configuration for new maps cluster. [puppet] - 10https://gerrit.wikimedia.org/r/290472 (https://phabricator.wikimedia.org/T134901) [15:15:15] (03PS2) 10Giuseppe Lavagetto: mediawiki: decommission old codfw appservers [puppet] - 10https://gerrit.wikimedia.org/r/290407 (https://phabricator.wikimedia.org/T135468) [15:16:17] thcipriani: split patch is done. [15:16:29] kart_: thank you! [15:16:40] (03CR) 10Gehel: [C: 032] Specific configuration for new maps cluster. [puppet] - 10https://gerrit.wikimedia.org/r/290472 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [15:17:47] I guess we're in the midst of decommisioning codfw servers which might explain the host key problems? [15:18:27] (03PS3) 10Giuseppe Lavagetto: mediawiki: decommission old codfw appservers [puppet] - 10https://gerrit.wikimedia.org/r/290407 (https://phabricator.wikimedia.org/T135468) [15:18:40] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:19:22] thcipriani, let me check the log [15:20:02] where is scap log? [15:20:35] <_joe_> thcipriani: yeah sorry [15:20:53] <_joe_> thcipriani: we didn't, I am removing them and I have not been fast enough with my change [15:21:53] _joe_: ah, as long as it's fine that I can't connect since they're no longer around. [15:21:55] <_joe_> thcipriani: fixing in 2 minutes [15:22:00] (03PS4) 10Giuseppe Lavagetto: mediawiki: decommission old codfw appservers [puppet] - 10https://gerrit.wikimedia.org/r/290407 (https://phabricator.wikimedia.org/T135468) [15:22:08] kk, I'll wait until then. [15:22:12] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6495380 keys - replication_delay is 0 [15:22:15] <_joe_> let's merge this and run it on tin [15:22:22] jynus: scap sends it's logs to logstash, FYI [15:22:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: decommission old codfw appservers [puppet] - 10https://gerrit.wikimedia.org/r/290407 (https://phabricator.wikimedia.org/T135468) (owner: 10Giuseppe Lavagetto) [15:22:39] a, ok, good to know for the next time [15:23:10] https://logstash.wikimedia.org/#/dashboard/elasticsearch/scap [15:24:23] I can see now the errors on fluorine [15:24:36] 47 apaches had sync errors [15:24:54] <_joe_> jynus: just 47? [15:25:00] yeah + 2 proxy nodes [15:25:17] that is just the last WARNING [15:25:19] <_joe_> thcipriani: 1 min and tin should be done running puppet [15:26:00] thank you [15:26:02] didn't go back much in time, as it was "expected", just wanted to localize the log [15:26:05] <_joe_> !log shutting down mw2001-2060 [15:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:22] <_joe_> thcipriani: green light from me [15:26:32] _joe_: cool, thanks :) [15:26:32] <_joe_> sorry I was supposed to merge this before 5 pm [15:26:38] <_joe_> but then got derailed [15:27:09] <_joe_> (5 pm my time is morning swat) [15:27:28] yarp. no harm done. woke me up better than my morning coffee. [15:33:19] kart_: so I think there is now a different problem with the way patches are split it seems like $wmgULSCompactLanguageLinksBetaFeature will be undefined in production after the first patch set merges. [15:34:04] until https://gerrit.wikimedia.org/r/#/c/290475/1/wmf-config/InitialiseSettings.php syncs [15:36:44] hmm.. [15:37:31] (03PS5) 10Filippo Giunchedi: prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) [15:39:18] (03PS1) 10Filippo Giunchedi: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) [15:40:16] thcipriani: syn CommongSettings.php first, and then InitialSettings.php, will that work? [15:41:07] thcipriani: or reverse? [15:41:26] thcipriani: first patch, second patch's initialsettings and then commonsettings. [15:41:35] kart_: my worry is that once I sync the first patch it'll spam the logs about wmgULSCompactLanguageLinksBetaFeature until I sync the other patch. If I do them both at the same time then it's the the same as the first patch. [15:42:06] if you could do all the variable adding in the first patch, and all the variable removing in the 2nd I could get it sync'd out I think. [15:42:21] thcipriani: let me try again. [15:42:30] kart_: thanks :) [15:43:52] thcipriani: Sorry, but that wasn't I'm doing? [15:44:15] See first patch only adds new vars, inclduing keep old. [15:45:03] kart_: yeah but the second patch is what adds wmgULSCompactLanguageLinksBetaFeature to InitialiseSettings.php that needs to happen in the first patch [15:45:41] OK. Got it. [15:45:49] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/290455 (owner: 10Filippo Giunchedi) [15:49:23] (03PS7) 10KartikMistry: Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) [15:49:52] thcipriani: ^ [15:51:04] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2323244 (10Joe) @papaul mw2001-2016 and mw2018-mw2060 are turned off and effectively decommissioned. Please take care of not turning off/unrack mw2017 as it is actively used as a deb... [15:51:20] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2323245 (10Joe) a:05Joe>03Papaul [15:52:47] (03PS8) 10Thcipriani: Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [15:52:52] (03PS1) 10EBernhardson: Change elasticsearch disk critical from 15% to 13% [puppet] - 10https://gerrit.wikimedia.org/r/290481 [15:53:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [15:53:06] kart_: thanks :) [15:53:10] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [15:53:44] (03Merged) 10jenkins-bot: Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [15:54:16] I hate rebase, but lets see how it goes. [15:55:45] 06Operations, 07Graphite: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385#2323277 (10fgiunchedi) with the cassandra-metrics-collector changes deployed I haven't seen yet a reoccurence of queues full and drops/stalls, though there seem to be some very low... [15:57:03] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:290165|Enable Compact Language Links as default in Beta]] PART I (duration: 00m 23s) [15:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:57:13] (03PS2) 10KartikMistry: Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290475 (https://phabricator.wikimedia.org/T134966) [15:57:25] thcipriani: Second patch: ^ [15:57:37] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:290165|Enable Compact Language Links as default in Beta]] PART II (duration: 00m 25s) [15:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:59:21] kart_: looks like you are defining wmgULSCompactLanguageLinksBetaFeature twice where you only need to remove wmgULSCompactLinks here: https://gerrit.wikimedia.org/r/#/c/290475/2/wmf-config/InitialiseSettings.php [15:59:34] thcipriani: oops. checking. [15:59:42] thanks [16:00:05] godog moritzm coreyfloyd: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160524T1600). [16:00:05] coreyfloyd: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:01:22] thcipriani: sorry, rebase mistake. [16:01:40] no problem [16:03:28] (03PS3) 10KartikMistry: Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290475 (https://phabricator.wikimedia.org/T134966) [16:04:50] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 61 failures [16:05:43] !log mwscript deleteEqualMessages.php --wiki fiwikinews (T45917) [16:05:44] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [16:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:48] thcipriani: looks good now? [16:06:55] * thcipriani checks [16:07:23] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290475 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [16:07:36] kart_: thanks! sorry for the hassle :( [16:07:53] thcipriani: heh. no problem. [16:08:00] (03Merged) 10jenkins-bot: Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290475 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [16:10:19] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:290475|Enable Compact Language Links as default in Beta]] PART I (duration: 00m 23s) [16:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:29] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:290475|Enable Compact Language Links as default in Beta]] PART II (duration: 00m 25s) [16:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:41] ^ kart_ sync'd \o/ [16:11:56] yo [16:12:00] thcipriani: thanks a lot [16:13:08] (03PS3) 10Waxmiguel: contint: bump pip 7.0.1 -> 8.1.2 [puppet] - 10https://gerrit.wikimedia.org/r/289639 (owner: 10Hashar) [16:14:20] thcipriani: working as expected. beta feature -> pref in beta. [16:14:31] no harm in production :) [16:14:35] kart_: awesome thanks for checking. [16:14:37] :D [16:16:45] well, there is issue in beta, let me see. [16:16:55] but not related to our deployment. [16:17:21] kart_: unrelated, but are you the person who rebuilt the debian nginx packages to use dynamic module loading? if so the new lua module is an older version than the one that was not dynamically linked... [16:18:14] YuviPanda: I'm not one, but you can tell me. [16:18:23] YuviPanda: is there any bug report? [16:18:32] kart_: no, I *just* ran into it :D I'll file a bug report later [16:18:42] (03CR) 10EBernhardson: "not sure if this is the right change to make, or if instead the check should require a certain # of events over threshold, or if perhaps t" [puppet] - 10https://gerrit.wikimedia.org/r/290481 (owner: 10EBernhardson) [16:18:46] YuviPanda: reportbug :) [16:18:52] will do :) [16:18:57] YuviPanda: let me have a look. [16:20:21] thcipriani: to be sure: Have we sync -labs file? [16:20:26] (03PS1) 10Gehel: Increase time before alter for elasticsearch disk space issues [puppet] - 10https://gerrit.wikimedia.org/r/290487 [16:20:57] * thcipriani looks [16:20:58] thcipriani: InitialiseSettings-labs.php [16:21:48] kart_: ugh. no, there's is a problem with sshd access to deployment-tin so jenkins has not sync'd. I'm working on it now. we're a bit backed up right now :( [16:22:04] kart_: I can poke you when all has been sync'd [16:22:51] thcipriani: sure. thanks. [16:23:11] thcipriani: leave me or Nikerabbit msg. I will go to bed in 30 minutes. [16:23:23] kart_: kk, will do. [16:24:40] (03PS1) 10Eevans: By-pass graphite-in; Use graphite1003 directly [puppet] - 10https://gerrit.wikimedia.org/r/290488 (https://phabricator.wikimedia.org/T135385) [16:24:41] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 625 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6501546 keys - replication_delay is 625 [16:25:22] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Update tag and racktables for holmium: rename to labservices1002. - https://phabricator.wikimedia.org/T119533#2323437 (10Andrew) Ready to go now. [16:29:48] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, this will require restarting cassandra-metrics-collector IIRC" [puppet] - 10https://gerrit.wikimedia.org/r/290488 (https://phabricator.wikimedia.org/T135385) (owner: 10Eevans) [16:30:06] thcipriani: I just got msg that beta is updated :) [16:30:47] ie beta-mediawiki-config-update-eqiad spams [16:31:18] kart_: glad to hear it :) I think the ssh issue was fixed, I wasn't sure if I still had to futz with jenkins before it would go. [16:33:29] (03CR) 10Eevans: "> LGTM, this will require restarting cassandra-metrics-collector IIRC" [puppet] - 10https://gerrit.wikimedia.org/r/290488 (https://phabricator.wikimedia.org/T135385) (owner: 10Eevans) [16:39:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] By-pass graphite-in; Use graphite1003 directly [puppet] - 10https://gerrit.wikimedia.org/r/290488 (https://phabricator.wikimedia.org/T135385) (owner: 10Eevans) [16:47:03] (03PS1) 10Mobrovac: service::node: Prepare for scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/290490 [16:51:20] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2323508 (10Papaul) starting disk wipe on mw2001-mw2016 and mw2018- mw2060 [16:53:35] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2323516 (10jcrespo) a:05RobH>03Papaul [16:53:37] (03CR) 10Mobrovac: "PCC ok - https://puppet-compiler.wmflabs.org/2894/" [puppet] - 10https://gerrit.wikimedia.org/r/290490 (owner: 10Mobrovac) [16:53:58] (03PS1) 10RobH: expanding sc-admins rights and members [puppet] - 10https://gerrit.wikimedia.org/r/290491 (https://phabricator.wikimedia.org/T135548) [16:54:59] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#2323525 (10Andrew) [16:55:20] PROBLEM - HHVM rendering on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:56:47] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2323529 (10jcrespo) Actually, I do not know if this should be RobH's or Papaul's, you can negotiate that. Have T134755#2276334 and T128057#2095309 in mind. [16:57:00] PROBLEM - Apache HTTP on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:57:09] PROBLEM - configured eth on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:09] PROBLEM - nutcracker port on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:10] PROBLEM - Check size of conntrack table on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:30] PROBLEM - DPKG on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:50] PROBLEM - HHVM processes on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:55] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2323533 (10RobH) a:05Papaul>03RobH First I'll take it for general review and update, then re-task to papaul for the onsite wipes. So I'm stealing this back. [16:58:07] jynus: thats mine until i review for decom and check all the software side stuff for removal [16:58:12] its why i had in first place ;D [16:58:20] PROBLEM - RAID on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:58:26] papaul doesnt have root yet so he cannot force all the decom stuff through is all [16:58:30] PROBLEM - puppet last run on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:58:31] PROBLEM - SSH on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:58:33] so eventually, he will. [16:58:41] PROBLEM - Disk space on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:58:49] PROBLEM - dhclient process on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:58:56] so in similar eqiad based items, chris would likely just do it all [16:59:04] 06Operations, 10RESTBase-Cassandra, 06Services, 10cassandra: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2323538 (10Eevans) Where are we on this? Is there more to do? [16:59:18] robh- then that is my confusion [16:59:25] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2323546 (10kaldari) [16:59:30] in theory, I did everything except the pending DNS, but a review is welcome [16:59:44] awesome! thanks for doing it all =] [17:00:04] oh, we also have to disable switch ports [17:00:05] yurik gwicke cscott arlolra subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160524T1700). Please do the needful. [17:00:21] yep, will deploy maps [17:00:40] most dbs are just ranges, so there is not much to do on dhcp/install, only the inidividual ips [17:00:48] (03PS7) 10Dzahn: ircecho: add icinga process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) [17:01:10] jynus: well, you had to remove from puppet db and stuff too right? though i guess you can regex puppet commands [17:01:11] PROBLEM - salt-minion processes on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:01:26] robh, that of course is done [17:01:31] (03CR) 10Dzahn: "i removed the part for the ircd because we already have a check that actually checks the connection, but i am adding this for the irc bot " [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [17:01:45] cool, yeah i'll finish the review and update shortly [17:01:46] mediawiki, puppet, monitoring [17:02:10] install module [17:02:12] so just dns and switch configs, i dont mind easy decoms like this =] [17:02:15] salt keys? [17:02:33] what I do not touch is dns until confirmation [17:02:39] and rackspace and network [17:02:58] yep, i'll pull network and dns shortly and then papaul will wipe them [17:03:04] and pull from rackspace when he unracks them. [17:03:07] PROBLEM - nutcracker process on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:03:33] one thing for the decomissions that will be pending [17:04:09] you may find references to old servers still on puppet- I will delete a whole module of a deprecated class in a single time [17:06:27] if they are out of site.pp but seem to stay in icinga, then puppetstoredconfigclean.rb on the master normally fixes that [17:06:52] no, that is done [17:07:09] uh, they are still in install module it seems, pulling now. [17:07:25] I mean there is an ugly class file referencing "db1001", etc [17:07:26] okay [17:07:45] I will delete the whole module, instead of commiting every single one [17:09:04] (03PS3) 10Dzahn: mv files/misc/scripts/Makefile to scap module [puppet] - 10https://gerrit.wikimedia.org/r/289351 [17:10:07] (03PS1) 10RobH: decommissioning es2005-es2010 [puppet] - 10https://gerrit.wikimedia.org/r/290495 [17:10:50] I may have missed that by not using the official checklist [17:11:07] and having decommed too many systems in the last days [17:11:08] (03CR) 10Dzahn: [C: 032] "only comments and the Makefile that isn't used in a class" [puppet] - 10https://gerrit.wikimedia.org/r/289351 (owner: 10Dzahn) [17:11:12] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/290491 (https://phabricator.wikimedia.org/T135548) (owner: 10RobH) [17:11:36] deploying kartotherian... [17:11:44] jynus: no worries, you handled getting out of icinga and that is the painful part (since it can cuase false pages ;) [17:11:55] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2323599 (10matmarex) [17:11:58] ie: forgetting to remove one, turning off, heh. [17:12:13] (03PS2) 10RobH: decommissioning es2005-es2010 [puppet] - 10https://gerrit.wikimedia.org/r/290495 [17:12:22] (03CR) 10RobH: [C: 032 V: 032] decommissioning es2005-es2010 [puppet] - 10https://gerrit.wikimedia.org/r/290495 (owner: 10RobH) [17:12:25] _joe_: so back to trusty on ocg1003 for now you said hm ? https://gerrit.wikimedia.org/r/#/c/290297/ [17:13:41] <_joe_> mutante: I sadly think we should do that :/ [17:14:02] <_joe_> since neither me nor you have the time to do the right thing [17:14:21] <_joe_> that would be moving ocg to scap3 for the jessie version [17:14:33] _joe_: yea, it's still a bit progress though, since it fixes the partioning and small / [17:14:41] and we know later we can use the puppet module already [17:14:42] <_joe_> mutante: right [17:14:53] <_joe_> yes, your work is not wasted at all [17:14:58] <_joe_> apart from the actual reimaging [17:15:02] (03PS2) 10RobH: expanding sc-admins rights and members [puppet] - 10https://gerrit.wikimedia.org/r/290491 (https://phabricator.wikimedia.org/T135548) [17:15:05] yep [17:15:56] (03PS1) 10Rush: labstore move standard below role assign [puppet] - 10https://gerrit.wikimedia.org/r/290498 [17:16:21] (03PS3) 10Dzahn: ocg: ocg1003 back to trusty installer [puppet] - 10https://gerrit.wikimedia.org/r/290297 (https://phabricator.wikimedia.org/T84723) [17:16:35] (03CR) 10RobH: [C: 032] "approved in operations meeting" [puppet] - 10https://gerrit.wikimedia.org/r/290491 (https://phabricator.wikimedia.org/T135548) (owner: 10RobH) [17:17:11] (03PS2) 10Rush: labstore move standard below role assign [puppet] - 10https://gerrit.wikimedia.org/r/290498 [17:18:14] 06Operations, 10Ops-Access-Requests, 06Services, 13Patch-For-Review: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2323627 (10RobH) 05Open>03Resolved I got the service names for inclusion from @mobrovac (as it seems Gabriel is out today, an... [17:18:27] (03PS3) 10Rush: labstore move standard below role assign [puppet] - 10https://gerrit.wikimedia.org/r/290498 [17:18:40] mobrovac: so i just merged those rights for sc-admins [17:18:48] im going to watch puppet run on sca1001 just out of paranoia [17:19:11] it said no issues when i ran test but meh, im paranoid. [17:19:45] ok good, no issues [17:20:08] urandom: ^ [17:20:13] you now have sc-admin rights [17:20:23] (03PS4) 10Rush: labstore move standard below role assign [puppet] - 10https://gerrit.wikimedia.org/r/290498 [17:21:04] Pchelolo: ^ same =] [17:21:27] RECOVERY - HHVM processes on mw1134 is OK: PROCS OK: 6 processes with command name hhvm [17:21:36] RECOVERY - salt-minion processes on mw1134 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:22:07] RECOVERY - nutcracker port on mw1134 is OK: TCP OK - 0.000 second response time on port 11212 [17:22:07] RECOVERY - configured eth on mw1134 is OK: OK - interfaces up [17:22:18] RECOVERY - RAID on mw1134 is OK: OK: no RAID installed [17:22:27] RECOVERY - Check size of conntrack table on mw1134 is OK: OK: nf_conntrack is 0 % full [17:22:37] RECOVERY - SSH on mw1134 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [17:22:56] RECOVERY - DPKG on mw1134 is OK: All packages OK [17:22:57] !log deployed & restarted kartotherian & tilerator. https://gerrit.wikimedia.org/r/#/c/290494/ https://gerrit.wikimedia.org/r/#/c/290497/ [17:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:24:32] (03CR) 10Rush: [C: 032] labstore move standard below role assign [puppet] - 10https://gerrit.wikimedia.org/r/290498 (owner: 10Rush) [17:24:42] robh: yes, I can login to sca1001 now. Thank you! (not to scb yet, but should just wait a bit for a puppet run) [17:25:18] yeah i only manually ran it on the first [17:26:18] i have yet to have the test compiler say its fine and then fail in production, but itll happen someday. i tend to force run merges on an a single affected host out of paranoia. [17:27:47] PROBLEM - salt-minion processes on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:27:52] 06Operations, 10ops-codfw, 10DBA, 06DC-Ops, 13Patch-For-Review: es2010 controller issue - https://phabricator.wikimedia.org/T127769#2323680 (10Papaul) 05Open>03Resolved closing this task since we are decommissioning this in T129452 [17:27:59] robh: in our case paranoia is a good thing :) [17:28:24] Indeed, we consider it a character trait, not flaw. [17:28:27] PROBLEM - nutcracker port on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:28:27] PROBLEM - configured eth on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:28:30] 06Operations, 10ops-codfw, 06DC-Ops: es2004 doesn't come back up after reboot - https://phabricator.wikimedia.org/T126203#2323685 (10Papaul) 05Open>03Resolved closing this task since we are decommissioning this in T129452 [17:28:37] PROBLEM - RAID on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:28:46] PROBLEM - Check size of conntrack table on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:28:57] PROBLEM - SSH on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:08] PROBLEM - DPKG on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:32:06] PROBLEM - HHVM processes on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:34:24] robh: kk [17:36:10] mmmm is mw1134 api? Seeing from serverboard that memory utilization is full [17:36:28] not able to ssh, will try root login and then soft reboot [17:38:42] !log soft rebooted mw1134 due to unresponsiveness (no root login, no ssh login, memory exhaustion from server-board) [17:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:40:49] RECOVERY - nutcracker port on mw1134 is OK: TCP OK - 0.000 second response time on port 11212 [17:40:49] RECOVERY - configured eth on mw1134 is OK: OK - interfaces up [17:41:06] RECOVERY - RAID on mw1134 is OK: OK: no RAID installed [17:41:07] RECOVERY - Check size of conntrack table on mw1134 is OK: OK: nf_conntrack is 0 % full [17:41:14] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2323773 (10Papaul) [17:41:17] RECOVERY - SSH on mw1134 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [17:41:36] RECOVERY - DPKG on mw1134 is OK: All packages OK [17:41:37] RECOVERY - nutcracker process on mw1134 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:41:38] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [17:41:47] RECOVERY - Disk space on mw1134 is OK: DISK OK [17:41:48] RECOVERY - dhclient process on mw1134 is OK: PROCS OK: 0 processes with command name dhclient [17:42:06] RECOVERY - HHVM processes on mw1134 is OK: PROCS OK: 6 processes with command name hhvm [17:42:07] RECOVERY - salt-minion processes on mw1134 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:42:27] RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 66991 bytes in 0.846 second response time [17:42:28] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.310 second response time [17:43:19] this is weird, I've only seen this behavior for jobrunners [17:45:48] it has happened like every once in a while with a random appserver too [17:49:28] ah thanks mutante :) I was checking https://phabricator.wikimedia.org/T122069 [17:49:43] (03PS4) 10Nuria: Ensure we pull latest on analytics.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/290284 (https://phabricator.wikimedia.org/T134506) [17:50:53] (03PS5) 10Ottomata: Ensure we pull latest on analytics.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/290284 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria) [17:51:01] (03CR) 10Ottomata: [C: 032 V: 032] Ensure we pull latest on analytics.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/290284 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria) [17:58:52] (03PS1) 10Dzahn: rcstream: lt silver,labtestweb connect to redis [puppet] - 10https://gerrit.wikimedia.org/r/290504 [17:59:19] (03PS2) 10Dzahn: rcstream: fix stream for wikitech, let silver connect to redis [puppet] - 10https://gerrit.wikimedia.org/r/290504 [18:00:37] (03PS1) 10Eevans: enable restbase2006-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/290505 (https://phabricator.wikimedia.org/T95253) [18:01:30] elukey: you're aren't still around by any chance, are you? [18:02:57] (03CR) 10Aude: "this would be very helpful :)" [puppet] - 10https://gerrit.wikimedia.org/r/290504 (owner: 10Dzahn) [18:03:31] could i convince someone with +2 on puppet to merge: https://gerrit.wikimedia.org/r/290505 ? [18:04:02] it'll start a bootstrap of restbase2006-b.codfw.wmnet, totally safe [18:04:35] (03PS3) 10Dzahn: rcstream: fix stream for wikitech, let silver connect to redis [puppet] - 10https://gerrit.wikimedia.org/r/290504 [18:04:44] normally godog hooks me up, but we're trying to keep these going on a more or less continuous basis until it's done, and for some reason he likes to eat and sleep [18:05:27] urandom: yes, can do [18:05:45] mutante: thanks! [18:05:50] (03PS2) 10Dzahn: enable restbase2006-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/290505 (https://phabricator.wikimedia.org/T95253) (owner: 10Eevans) [18:06:20] (03CR) 10Dzahn: [C: 032] enable restbase2006-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/290505 (https://phabricator.wikimedia.org/T95253) (owner: 10Eevans) [18:07:50] urandom: and now it has effect on the master [18:07:58] mutante: awesome, thanks again! [18:08:30] !log Starting bootstrap of restbase2006-b.codfw.wmnet : T95253 [18:08:31] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [18:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:08:54] twentyafterfour: you're deploying the train today? can you wait with the branch cut until https://gerrit.wikimedia.org/r/290506 merges? [18:09:25] MatmaRex: I just started the branch cut, [18:09:37] * twentyafterfour types ctrl-z [18:09:57] twentyafterfour: did https://gerrit.wikimedia.org/r/#/c/289547/ make it? (the change i'm reverting, merged an hour ago) [18:10:47] (03PS4) 10Dzahn: rcstream: fix stream for wikitech, let silver connect to redis [puppet] - 10https://gerrit.wikimedia.org/r/290504 [18:11:43] MatmaRex: core won't start branching until after the submodules so I think the revert will be in the branch [18:15:31] (03PS1) 10Eevans: enable instance restbase1010-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/290508 (https://bugzilla.wikimedia.org/134016) [18:18:20] twentyafterfour: ugh, the queue seems to be stuck https://integration.wikimedia.org/zuul/ [18:18:49] https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/44156/console this looks hopeless [18:19:09] do CI things still break horribly when you force-merge? [18:20:13] I killed it [18:22:13] mutante: could you do one more for me (last one today)? - https://gerrit.wikimedia.org/r/#/c/290508/ [18:22:26] (03CR) 10Dzahn: [C: 031] Enable base::firewall in the pool counter role [puppet] - 10https://gerrit.wikimedia.org/r/290405 (owner: 10Muehlenhoff) [18:22:39] mutante: i'd have done them together, but this will need an extra push, and I needed the other running first [18:23:15] MatmaRex: yes afaik [18:23:49] urandom: yep, checked the IP too [18:24:01] (03CR) 10Dzahn: [C: 032] enable instance restbase1010-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/290508 (https://bugzilla.wikimedia.org/134016) (owner: 10Eevans) [18:24:28] you can go ahead. done [18:24:41] mutante: great, thank you! [18:24:53] yw, np [18:25:40] (03PS4) 10Dzahn: ocg: ocg1003 back to trusty installer [puppet] - 10https://gerrit.wikimedia.org/r/290297 (https://phabricator.wikimedia.org/T84723) [18:25:45] MatmaRex: as long as that merges in the next few minutes it'll be in the branch, otherwise we can cherry-pick it [18:25:49] (03CR) 10Dzahn: [C: 032] ocg: ocg1003 back to trusty installer [puppet] - 10https://gerrit.wikimedia.org/r/290297 (https://phabricator.wikimedia.org/T84723) (owner: 10Dzahn) [18:26:17] (03CR) 10Dzahn: [V: 032] ocg: ocg1003 back to trusty installer [puppet] - 10https://gerrit.wikimedia.org/r/290297 (https://phabricator.wikimedia.org/T84723) (owner: 10Dzahn) [18:26:29] twentyafterfour: it apparently restarted from scratch [18:27:32] (03CR) 10Dzahn: [C: 031] scap: add labtestwikitech to mediawiki-installation group [puppet] - 10https://gerrit.wikimedia.org/r/290348 (owner: 10Alex Monk) [18:28:04] !log Starting bootstrap of restbase1010-c.eqiad.wmnet : T134016 [18:28:05] T134016: RESTBase Cassandra cluster: Increase instance count from 2 to 3 - https://phabricator.wikimedia.org/T134016 [18:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:32:39] twentyafterfour: it finally merged. made it in time? [18:34:20] (03PS7) 10Rush: labstore cleanup and role vs module arrange [puppet] - 10https://gerrit.wikimedia.org/r/289964 [18:38:59] PROBLEM - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is CRITICAL: Connection refused [18:39:19] PROBLEM - cassandra-b CQL 10.192.48.50:9042 on restbase2006 is CRITICAL: Connection refused [18:43:18] (03PS22) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [18:47:04] (03CR) 1020after4: "Ok I think this is ready for production. It's been on beta cluster since yesterday. There are production deployment concerns regarding the" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [18:47:31] (03PS23) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [18:50:59] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: puppet fail [18:55:09] (03PS8) 10Rush: labstore cleanup and role vs module arrange [puppet] - 10https://gerrit.wikimedia.org/r/289964 [18:55:36] !Log Performing cleanups on restbase2004.codfw.wmnet [18:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:59:07] !Log Performing cleanup on restbase2008-a.codfw.wmnet [18:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160524T1900). Please do the needful. [19:02:53] (03CR) 10Rush: [C: 032] labstore cleanup and role vs module arrange [puppet] - 10https://gerrit.wikimedia.org/r/289964 (owner: 10Rush) [19:05:07] !Log Performing cleanup on restbase2003-a.codfw.wmnet [19:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:06:48] !Log Performing cleanup on restbase2001-a.codfw.wmnet [19:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:42] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is CRITICAL: Connection refused eevans Node is bootstrapping - The acknowledgement expires at: 2016-05-26 19:08:17. [19:10:10] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.48.50:9042 on restbase2006 is CRITICAL: Connection refused eevans Node is bootstrapping. - The acknowledgement expires at: 2016-05-26 19:09:47. [19:14:13] !log ocg1003, powercycle for reinstall, scheduled downtime [19:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:30] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [19:21:09] (03CR) 10Andrew Bogott: [C: 04-1] rcstream: fix stream for wikitech, let silver connect to redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290504 (owner: 10Dzahn) [19:22:39] (03CR) 10Andrew Bogott: [C: 031] labtestweb2001: add IPv6 like on silver [puppet] - 10https://gerrit.wikimedia.org/r/290351 (owner: 10Dzahn) [19:24:16] (03CR) 10Dzahn: rcstream: fix stream for wikitech, let silver connect to redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290504 (owner: 10Dzahn) [19:24:20] (03PS5) 10Dzahn: rcstream: fix stream for wikitech, let silver connect to redis [puppet] - 10https://gerrit.wikimedia.org/r/290504 [19:25:08] (03CR) 10Dzahn: "btw, i copied that from role for lucene, so we might want to adjust that over there too" [puppet] - 10https://gerrit.wikimedia.org/r/290504 (owner: 10Dzahn) [19:25:26] (03CR) 10Andrew Bogott: [C: 031] rcstream: fix stream for wikitech, let silver connect to redis [puppet] - 10https://gerrit.wikimedia.org/r/290504 (owner: 10Dzahn) [19:26:22] (03PS6) 10Dzahn: rcstream: fix stream for wikitech, let silver connect to redis [puppet] - 10https://gerrit.wikimedia.org/r/290504 [19:26:51] (03CR) 10Dzahn: [C: 032] "thanks Andrew!" [puppet] - 10https://gerrit.wikimedia.org/r/290504 (owner: 10Dzahn) [19:28:35] (03PS1) 10RobH: decommissioning es2005-es2010 [dns] - 10https://gerrit.wikimedia.org/r/290519 [19:32:07] (03CR) 10Dzahn: "confirmed on rcs1001." [puppet] - 10https://gerrit.wikimedia.org/r/290504 (owner: 10Dzahn) [19:37:05] (03PS3) 10Dzahn: labtestweb2001: add IPv6 like on silver [puppet] - 10https://gerrit.wikimedia.org/r/290351 [19:39:26] (03PS1) 10Rush: labstore clean up hiera application [puppet] - 10https://gerrit.wikimedia.org/r/290522 [19:39:34] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2324151 (10RobH) The switch ports for es2005-es2010 have been disabled, but still have the ES labels/descriptions. Once these machines are fully wiped and removed from the rack, the... [19:40:01] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2324153 (10RobH) a:05RobH>03Papaul [19:40:51] (03CR) 10RobH: [C: 032] decommissioning es2005-es2010 [dns] - 10https://gerrit.wikimedia.org/r/290519 (owner: 10RobH) [19:43:42] (03CR) 10Rush: [C: 032] labstore clean up hiera application [puppet] - 10https://gerrit.wikimedia.org/r/290522 (owner: 10Rush) [19:44:03] (03CR) 10Dzahn: [C: 032] labtestweb2001: add IPv6 like on silver [puppet] - 10https://gerrit.wikimedia.org/r/290351 (owner: 10Dzahn) [19:44:10] (03PS4) 10Dzahn: labtestweb2001: add IPv6 like on silver [puppet] - 10https://gerrit.wikimedia.org/r/290351 [19:44:56] !log mwscript deleteEqualMessages.php --wiki warwiki (T45917) [19:44:57] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [19:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:45:55] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 76 failures [19:48:12] !log mw1142 - Could allocate memory on puppet run, restarted hhvm service, that fixed it [19:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:48:55] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.075 second response time [19:49:56] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 66991 bytes in 0.232 second response time [19:50:30] !log mw1133 - Could not allocate memory, restarted hhvm service, ran puppet [19:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:51:25] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:51:26] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: puppet fail [19:52:56] PROBLEM - HHVM rendering on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:06] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:53:36] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:08] PROBLEM - SSH on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:21] !log mwscript deleteEqualMessages.php --wiki cywiktionary (T45917) [19:55:22] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [19:55:25] PROBLEM - RAID on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:55:26] PROBLEM - configured eth on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:55:45] PROBLEM - nutcracker port on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:55:46] PROBLEM - Check size of conntrack table on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:55:46] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:56] PROBLEM - nutcracker process on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:56:06] PROBLEM - DPKG on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:56:25] PROBLEM - Disk space on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:56:25] PROBLEM - HHVM processes on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:56:25] PROBLEM - dhclient process on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:56:37] PROBLEM - HHVM rendering on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:15] PROBLEM - RAID on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:15] PROBLEM - nutcracker process on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:26] PROBLEM - nutcracker port on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:55] PROBLEM - SSH on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:58:06] PROBLEM - salt-minion processes on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:06] PROBLEM - Check size of conntrack table on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:15] PROBLEM - HHVM processes on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:26] PROBLEM - Disk space on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:35] PROBLEM - dhclient process on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:35] PROBLEM - DPKG on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:36] PROBLEM - configured eth on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:45] PROBLEM - salt-minion processes on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:02:48] RECOVERY - dhclient process on mw1146 is OK: PROCS OK: 0 processes with command name dhclient [20:02:48] RECOVERY - HHVM processes on mw1146 is OK: PROCS OK: 6 processes with command name hhvm [20:03:07] RECOVERY - Check size of conntrack table on mw1146 is OK: OK: nf_conntrack is 0 % full [20:03:18] RECOVERY - nutcracker process on mw1146 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [20:03:18] RECOVERY - salt-minion processes on mw1146 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:03:18] RECOVERY - DPKG on mw1146 is OK: All packages OK [20:03:38] RECOVERY - Disk space on mw1146 is OK: DISK OK [20:04:07] RECOVERY - RAID on mw1146 is OK: OK: no RAID installed [20:04:18] RECOVERY - SSH on mw1146 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [20:04:38] RECOVERY - configured eth on mw1146 is OK: OK - interfaces up [20:04:58] RECOVERY - nutcracker port on mw1146 is OK: TCP OK - 0.000 second response time on port 11212 [20:07:35] !log mwscript deleteEqualMessages.php --wiki diqwiki (T45917) [20:07:36] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [20:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:17] (03PS1) 10Rush: labstore ssh hiera parameters to role(s) [puppet] - 10https://gerrit.wikimedia.org/r/290526 [20:11:38] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:13:04] (03CR) 10Rush: [C: 032] labstore ssh hiera parameters to role(s) [puppet] - 10https://gerrit.wikimedia.org/r/290526 (owner: 10Rush) [20:24:55] (03PS1) 10Kaldari: Set Tamil projects to use uca-ta collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290529 (https://phabricator.wikimedia.org/T75453) [20:28:56] (03PS5) 10Dzahn: labtestweb2001: add IPv6 like on silver [puppet] - 10https://gerrit.wikimedia.org/r/290351 [20:29:25] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Switch CI from jsduck deb package to a gemfile/bundler system - https://phabricator.wikimedia.org/T109005#2324337 (10hashar) [20:31:07] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Switch CI from jsduck deb package to a gemfile/bundler system - https://phabricator.wikimedia.org/T109005#1537429 (10hashar) [20:32:24] (03PS1) 10Dzahn: elasticsearch: use generic names not hostnames in ferm [puppet] - 10https://gerrit.wikimedia.org/r/290531 [20:33:07] PROBLEM - NTP on mw1137 is CRITICAL: NTP CRITICAL: No response from NTP server [20:34:06] !log ocg1003 - reinstalled, replaced puppet cert, salt key..re-added [20:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:34:24] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Switch CI from jsduck deb package to a gemfile/bundler system - https://phabricator.wikimedia.org/T109005#2324357 (10hashar) With T136096 I have introduced a Jenkins job template that does roughly: * gem install jsduck * jsduck While at it... [20:36:18] (03PS2) 10Dzahn: elasticsearch: use generic names not hostnames in ferm [puppet] - 10https://gerrit.wikimedia.org/r/290531 [20:36:56] (03CR) 10Dzahn: [C: 032] labtestweb2001: add IPv6 like on silver [puppet] - 10https://gerrit.wikimedia.org/r/290351 (owner: 10Dzahn) [20:37:03] (03PS1) 10Yuvipanda: docker; Enable backports in base jessie repository [puppet] - 10https://gerrit.wikimedia.org/r/290533 [20:37:19] (03PS2) 10Yuvipanda: docker; Enable backports in base jessie repository [puppet] - 10https://gerrit.wikimedia.org/r/290533 [20:39:10] (03CR) 10Yuvipanda: [C: 032] docker; Enable backports in base jessie repository [puppet] - 10https://gerrit.wikimedia.org/r/290533 (owner: 10Yuvipanda) [20:40:27] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2324383 (10Dzahn) re-installed ocg1003 with trusty for now so that it can be used until we have a package for jessie All icinga services are green again, incl. ocg itself OK: ocg_job_status 39746... [20:40:48] RECOVERY - RAID on mw1137 is OK: OK: no RAID installed [20:40:58] RECOVERY - SSH on mw1137 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [20:41:18] RECOVERY - NTP on mw1137 is OK: NTP OK: Offset -0.01146769524 secs [20:41:19] RECOVERY - nutcracker port on mw1137 is OK: TCP OK - 0.000 second response time on port 11212 [20:41:38] RECOVERY - nutcracker process on mw1137 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [20:41:49] RECOVERY - Check size of conntrack table on mw1137 is OK: OK: nf_conntrack is 0 % full [20:42:07] RECOVERY - Disk space on mw1137 is OK: DISK OK [20:42:07] RECOVERY - salt-minion processes on mw1137 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:42:08] RECOVERY - dhclient process on mw1137 is OK: PROCS OK: 0 processes with command name dhclient [20:42:17] RECOVERY - DPKG on mw1137 is OK: All packages OK [20:42:17] RECOVERY - HHVM processes on mw1137 is OK: PROCS OK: 12 processes with command name hhvm [20:42:18] RECOVERY - configured eth on mw1137 is OK: OK - interfaces up [20:42:52] (03CR) 10Andrew Bogott: [C: 031] elasticsearch: use generic names not hostnames in ferm [puppet] - 10https://gerrit.wikimedia.org/r/290531 (owner: 10Dzahn) [20:42:54] (03CR) 10Dzahn: "this has not been applied yet because puppet is disabled on labtestweb2001 (Reason: 'andrew fiddling with horizon');" [puppet] - 10https://gerrit.wikimedia.org/r/290351 (owner: 10Dzahn) [20:43:18] (03PS5) 10Foks: Adding WMF Support and Safety user groups to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) [20:43:38] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6492771 keys - replication_delay is 0 [20:43:48] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:45:32] (03CR) 10Jalexander: [C: 031] Adding WMF Support and Safety user groups to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [20:49:46] (03PS1) 10Dzahn: add AAAA record for labtestweb2001 [dns] - 10https://gerrit.wikimedia.org/r/290542 [20:54:32] (03CR) 10ArielGlenn: "Do we need more than base so that e.g. scap3 works? FOr example the old boxes had a ferm rule:" [puppet] - 10https://gerrit.wikimedia.org/r/290422 (owner: 10Muehlenhoff) [20:58:37] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2324493 (10ssastry) >>! In T135176#2306160, @GWicke wrote: > Given that [Parsoid already supports running on service-runner](https://github.com/wikimedia/mediawiki-node-services/blob/... [20:59:50] train deployment is taking longer today because I didn't get the branch cut done early enough [21:04:31] lots of updates on commons [21:08:05] !log mw1137,mw1146 restarted hhvm service [21:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:08:28] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.537 second response time [21:08:48] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 2.239 second response time [21:09:22] (03PS3) 10Dzahn: elasticsearch: use generic names not hostnames in ferm [puppet] - 10https://gerrit.wikimedia.org/r/290531 [21:09:38] RECOVERY - HHVM rendering on mw1137 is OK: HTTP OK: HTTP/1.1 200 OK - 67567 bytes in 0.530 second response time [21:10:28] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 67567 bytes in 0.489 second response time [21:13:17] !log mwscript deleteEqualMessages.php --wiki mrwiki (T45917) [21:13:18] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [21:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:14:19] (03CR) 10Dzahn: [C: 032] elasticsearch: use generic names not hostnames in ferm [puppet] - 10https://gerrit.wikimedia.org/r/290531 (owner: 10Dzahn) [21:19:16] jynus: there are more inserts now... [21:19:27] (03CR) 10Dzahn: "confirmed on elastic1001" [puppet] - 10https://gerrit.wikimedia.org/r/290531 (owner: 10Dzahn) [21:19:57] and db1056 is lagging behind, I'm taking a look [21:20:18] (03CR) 10Dzahn: "eh, i mean, it did update the ferm rules:" [puppet] - 10https://gerrit.wikimedia.org/r/290531 (owner: 10Dzahn) [21:24:48] !log set innodb_flush_log_at_trx_commit=0 on db1056 that is lagging behind as a temporary measure [21:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:31:23] (03CR) 10Luke081515: [C: 031] Adding WMF Support and Safety user groups to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [21:34:44] (03PS1) 10Dzahn: irecho: add systemd require/after to start after ircd [puppet] - 10https://gerrit.wikimedia.org/r/290588 (https://phabricator.wikimedia.org/T134875) [21:35:29] (03PS2) 10Dzahn: irecho: add systemd require/after to start after ircd [puppet] - 10https://gerrit.wikimedia.org/r/290588 (https://phabricator.wikimedia.org/T134875) [21:35:53] (03PS3) 10Dzahn: ircecho: add systemd require/after to start after ircd [puppet] - 10https://gerrit.wikimedia.org/r/290588 (https://phabricator.wikimedia.org/T134875) [21:38:43] (03PS1) 1020after4: group0 to 1.28.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290590 [21:39:59] !log twentyafterfour@tin Started scap: sync testwiki to wmf/1.28.0-wmf.3 [21:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:42:30] 06Operations, 10ops-eqiad, 10DBA: db1056 BBU Failed - https://phabricator.wikimedia.org/T136136#2324634 (10Volans) [21:53:09] (03PS5) 10Krinkle: Convert mwgrep to use regexp by default [puppet] - 10https://gerrit.wikimedia.org/r/283107 (owner: 10EBernhardson) [21:53:18] 06Operations, 10Ops-Access-Requests: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2324659 (10Jalexander) [21:54:56] (03CR) 10Krinkle: [C: 04-1] Convert mwgrep to use regexp by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283107 (owner: 10EBernhardson) [21:54:58] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 19 failures [21:59:28] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:19] PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 1 failures [22:01:37] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.155 second response time [22:07:57] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:08:19] PROBLEM - HHVM rendering on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:08:57] PROBLEM - salt-minion processes on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:07] PROBLEM - HHVM rendering on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:09:07] PROBLEM - Disk space on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:08] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:09:18] PROBLEM - RAID on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:47] PROBLEM - SSH on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:09:48] PROBLEM - dhclient process on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:08] PROBLEM - nutcracker port on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:18] PROBLEM - Check size of conntrack table on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:18] PROBLEM - HHVM rendering on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:10:27] PROBLEM - nutcracker process on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:28] PROBLEM - configured eth on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:34] 06Operations, 10Ops-Access-Requests: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2324715 (10Mdennis-WMF) Approved. [22:10:37] PROBLEM - DPKG on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:47] PROBLEM - HHVM processes on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:57] PROBLEM - puppet last run on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:59] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 67554 bytes in 0.389 second response time [22:11:07] PROBLEM - SSH on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:11:07] PROBLEM - configured eth on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:11:18] PROBLEM - dhclient process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:11:37] PROBLEM - Check size of conntrack table on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:11:49] PROBLEM - salt-minion processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:11:57] PROBLEM - HHVM processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:11:57] PROBLEM - nutcracker process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:12:06] uhh [22:12:08] PROBLEM - nutcracker port on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:12:28] PROBLEM - Disk space on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:12:47] PROBLEM - RAID on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:12:48] PROBLEM - DPKG on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:13:07] RECOVERY - Disk space on mw1143 is OK: DISK OK [22:15:02] 06Operations, 10Ops-Access-Requests: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2324746 (10RobH) a:03RobH [22:15:43] hrmm, looks like two mw systerms are being problematic? [22:16:27] RECOVERY - Disk space on mw1136 is OK: DISK OK [22:16:47] twentyafterfour: i suppose something was already wrong with these, and scap/touchign them just pushed them over ;] [22:16:48] RECOVERY - RAID on mw1136 is OK: OK: no RAID installed [22:16:49] RECOVERY - DPKG on mw1136 is OK: All packages OK [22:17:08] RECOVERY - SSH on mw1136 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [22:17:09] RECOVERY - configured eth on mw1136 is OK: OK - interfaces up [22:17:18] RECOVERY - dhclient process on mw1136 is OK: PROCS OK: 0 processes with command name dhclient [22:17:25] robh: yeah that's what I was thinking as well, [22:17:38] no idea how to deal with something like that [22:17:38] RECOVERY - Check size of conntrack table on mw1136 is OK: OK: nf_conntrack is 0 % full [22:17:57] RECOVERY - salt-minion processes on mw1136 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:17:58] RECOVERY - HHVM processes on mw1136 is OK: PROCS OK: 6 processes with command name hhvm [22:17:58] RECOVERY - nutcracker process on mw1136 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [22:18:08] RECOVERY - nutcracker port on mw1136 is OK: TCP OK - 0.000 second response time on port 11212 [22:18:10] 2 nodes hung on rebuild cdb files, probably the same 2 [22:18:27] yep, i can see the rebuild commands just sitting as processes [22:18:27] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 67556 bytes in 7.593 second response time [22:18:37] HHVM ooming [22:18:38] RECOVERY - nutcracker process on mw1143 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [22:18:38] Out of memory: Kill process 12681 (hhvm [22:18:53] that's no big surprise :) [22:19:17] PROBLEM - Disk space on mw1143 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:19:17] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.085 second response time [22:19:24] volans: so we kill hhvm on these and restart it is all? [22:19:46] what about the rebuild that killed them in the first place (we need to refire it?) [22:19:58] RECOVERY - SSH on mw1143 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [22:20:07] RECOVERY - dhclient process on mw1143 is OK: PROCS OK: 0 processes with command name dhclient [22:20:17] RECOVERY - nutcracker port on mw1143 is OK: TCP OK - 0.000 second response time on port 11212 [22:20:27] RECOVERY - Check size of conntrack table on mw1143 is OK: OK: nf_conntrack is 0 % full [22:20:29] robh: I don't know, just logged and checking logs, I can see there was a spike in load, oom killer killed HHVM [22:20:29] twentyafterfour: me either apparently, so its a learning experience all around ;] [22:20:39] RECOVERY - configured eth on mw1143 is OK: OK - interfaces up [22:20:40] RECOVERY - DPKG on mw1143 is OK: All packages OK [22:20:48] RECOVERY - HHVM processes on mw1143 is OK: PROCS OK: 6 processes with command name hhvm [22:20:52] I was looking on wikitech if there is a checklist to ensure they are ok [22:20:58] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 34 minutes ago with 0 failures [22:21:07] RECOVERY - salt-minion processes on mw1143 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:21:19] RECOVERY - Disk space on mw1143 is OK: DISK OK [22:21:37] RECOVERY - RAID on mw1143 is OK: OK: no RAID installed [22:21:48] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:21:48] !log twentyafterfour@tin Finished scap: sync testwiki to wmf/1.28.0-wmf.3 (duration: 41m 48s) [22:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:22:08] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.151 second response time [22:22:15] yeah i dont really find anything for checklisting an apache server for service [22:22:24] they need cdb filed rebuilt [22:22:26] i looked a week ago since we just got a bunch of them in [22:22:28] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 67547 bytes in 0.252 second response time [22:22:43] maybe we should just restart the problematic services and fire the rebuild [22:22:48] to ensure any oom issue goes away? [22:22:49] I found this https://wikitech.wikimedia.org/wiki/Application_servers [22:23:09] volans: that predates hhvm [22:23:27] so some of it is accurate, some is not. [22:23:45] like the 'apache setup checklist' isnt [22:23:52] it says to apt install things ;] [22:24:07] great! :) [22:24:31] so yeah, i think we should reboot it and then fire puppet, and then fire the rebuild [22:24:47] cuz they are still sitting showing rebuild and who knows whats up since hhvm crashed out [22:24:47] I'm running scap sync-l10n a second time just to be sure [22:24:58] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:24:59] !log twentyafterfour@tin scap sync-l10n completed (1.28.0-wmf.3) (duration: 00m 37s) [22:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:25:11] twentyafterfour: oh, so the second run without issue? [22:25:42] (03CR) 1020after4: [C: 032] group0 to 1.28.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290590 (owner: 1020after4) [22:26:08] all the icinga errors cleared out... [22:26:13] maybe reboot isnt needed. [22:26:23] (03Merged) 10jenkins-bot: group0 to 1.28.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290590 (owner: 1020after4) [22:29:14] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.28.0-wmf.3 [22:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:33:02] Hello. [22:33:39] hai [22:33:58] jdlrobson: what do you want to do with https://gerrit.wikimedia.org/r/#/c/290411/ exactly? [22:34:35] cherry-pick it after 290411? [22:35:50] Dereckson: correct. [22:36:01] i couldnt use gerrit interface to do it and was on mobile at time [22:36:52] No problem, I'm preparing it. [22:37:21] Thanks Dereckson ! [22:37:37] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 629 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6495660 keys - replication_delay is 629 [22:37:53] jdlrobson: https://gerrit.wikimedia.org/r/#/c/290598/ [22:38:02] 06Operations, 10Ops-Access-Requests: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2324852 (10RobH) restricted allows a user to sudo as www-data and apache users, so it technically requires a review in the operations meetin... [22:39:38] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6494798 keys - replication_delay is 0 [22:42:25] (03PS1) 10RobH: access request for joe sutherland [puppet] - 10https://gerrit.wikimedia.org/r/290599 (https://phabricator.wikimedia.org/T136137) [22:42:42] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2324870 (10RobH) 05Open>03stalled [22:44:58] Dereckson: great! [22:47:17] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [22:47:50] ok, so we said we want an icinga check that actually looks at content on a wiki [22:47:51] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2324873 (10Jalexander) >>! In T136137#2324852, @RobH wrote: > restricted allows a user to sudo as www-data and apache... [22:47:59] (03CR) 10EBernhardson: "Updated. Also noticed while running this at terbium that the help was calling the argument term, where regex would be more appropriate so " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283107 (owner: 10EBernhardson) [22:48:08] (03PS6) 10EBernhardson: Convert mwgrep to use regexp by default [puppet] - 10https://gerrit.wikimedia.org/r/283107 [22:48:09] so a string that is in the wiki itself.. but does not change normally [22:48:23] so i suggest "Picture of the Day" on commons [22:48:39] On mira: there is an uncomitted wikiversions.json [22:48:45] Jamesofur: that shouldnt be an issue (merging before june 1st) [22:48:53] robh: thanks! [22:48:57] as long as no one objects i'll rebase and merge my patchset on monday =] [22:49:01] Allo [22:49:21] foks: we were just talking about your access request (nothign bad, just has to have ops meeting review as its sudo related) [22:49:30] * foks nods. [22:49:46] I'll defer that to Jamesofur for the time-being; not sure if there's any rush on it, personally [22:50:00] twentyafterfour: could you look to mira? there is a file matching your last 7eee4f36b913a6b9668e920683ac26a21a08aec2 commit, but not the commit itself [22:50:27] twentyafterfour: perphaps stash it and rebase the branch against origin/master [22:50:33] foks: yeah, I was saying that doing it by our meeting on Wednesday is best but easy to adjust it if needbe (and rob was saying that should be no problem) [22:50:42] OK. [22:52:28] jdlrobson: you want to sync includes/MobileFormatter.php in one step or two? [22:54:32] Hi foks, could you get https://gerrit.wikimedia.org/r/#/c/290581/ merged? Ask on #mediawiki-i18n perhaps? [22:54:55] * Jamesofur can merge but isn't sure he's supposed to [22:55:21] I'm ... actually not sure how you do it [22:55:45] * Jamesofur is asking in the -i18n channel [22:55:48] i can do it? [22:55:59] MatmaRex: if you're willing to review that would be great [22:56:14] One step would be good Dereckson [22:56:22] foks: #mediawiki-i18n is the hangout place of i18n/l10n people, they are good custodians for Wikimediaessages [22:56:34] Ah okay [22:57:11] foks: Jamesofur: i +2'd it [22:57:16] MatmaRex: thanks :D [22:57:17] :) [22:58:08] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 685 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6495766 keys - replication_delay is 685 [22:59:14] foks: Jamesofur: Dereckson: i think you'll need to backport that to wmf.2 and wmf.3, otherwise the new messages probably won't appear on the wikis until next week [22:59:31] MatmaRex: l10nupdate won't pick it? [22:59:38] (i say "probably" because i'm not entirely sure if LocalisationUpdate won't deploy them) [22:59:44] Dereckson: no idea [22:59:50] MatmaRex: my current though was just to create it on meta (where it's being created) temporarily. Do you think that would work or better to backport? [22:59:56] i would guess that it only does translations of messages existing in given branch [23:00:00] but i actually don't know [23:00:02] no from master [23:00:04] RoanKattouw ostriches Krenair MaxSem Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160524T2300). Please do the needful. [23:00:04] foks mlitn kaldari jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:12] Okay, so I can SWAT this evening. Let's start with foks. [23:00:20] go easy on me [23:00:25] :3 [23:00:27] First SWAT? [23:00:30] Welcome so. [23:00:35] Jamesofur: yeah, that would also work, i think [23:00:35] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [23:00:36] yeah, first patch(es) [23:01:41] \o [23:01:47] MatmaRex: Jamesofur: what about run l10nupdate at the end of the SWAT, check if it's correctly picked, and if not create them temporarily on meta? [23:01:57] that sounds good [23:02:07] (03PS6) 10Dereckson: Adding WMF Support and Safety user groups to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [23:02:14] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [23:02:34] Dereckson: i suppose. i really don't know anything about l10nupdate :D [23:02:57] MatmaRex: it's a job backporting l10n files from master to wmf branches [23:04:17] hu ho Zuul [23:04:35] (03CR) 10Dereckson: [C: 032] "SWAtT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [23:05:38] (03PS1) 10Dzahn: add icinga monitoring for content on commons [puppet] - 10https://gerrit.wikimedia.org/r/290606 (https://phabricator.wikimedia.org/T124812) [23:05:50] (03Merged) 10jenkins-bot: Adding WMF Support and Safety user groups to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [23:05:57] ooc, what is "Zuul"? [23:06:04] * foks n00b [23:06:10] foks: you've the Zuul dashboard on https://integration.wikimedia.org/zuul/ [23:06:24] it's a component picking changes from Gerrit and merging them [23:06:25] oo [23:06:28] after running Jenkins tassks [23:06:43] !add zuul Zuul is a python daemon which acts as a gateway between Gerrit and Jenkins. [23:06:45] So humans can't break the system merging stuff not passing tests, or conflicting. [23:06:56] !zuul Zuul is a python daemon which acts as a gateway between Gerrit and Jenkins. [23:06:57] * Jamesofur looks at dashboard [23:07:02] eh, how did it work.. [23:07:07] wm-bot: help [23:07:23] Hm. [23:07:35] Maybe it's not a "!" but a "@" or something. [23:07:41] * foks just watches. :D [23:07:47] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Adding WMF Support and Safety user groups to meta (T136046) (duration: 00m 26s) [23:07:47] T136046: add wmf-supportsafety user group on meta - https://phabricator.wikimedia.org/T136046 [23:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:03] kaldari: ping? [23:08:07] foks: please test ^ [23:08:15] !infobot is https://meta.wikimedia.org/wiki/Wm-bot#Infobot [23:08:16] Key was added [23:08:24] !zuul is a python daemon which acts as a gateway between Gerrit and Jenkins. [23:08:25] Key was added [23:08:28] Hi mlitn [23:08:49] foks: so your duty for SWAT is to test patches one merged, and be ready to offer a fix or request a revert if there is any issue [23:08:55] (once merged) [23:09:05] Looks OK to me [23:09:09] hi Dereckson [23:09:12] ?zuul [23:09:13] Thanks for testing. [23:09:15] :) [23:09:18] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:09:42] jdlrobson: you're next [23:09:52] oh maybe Zuul hasn't finisehd [23:09:55] thanks Dereckson [23:10:07] Results (Found 1): zuul, [23:10:07] @search zuul [23:10:24] jdlrobson: yes they're still in the gate-and-submit queue [23:10:26] just !zuul [23:10:55] !zuul [23:10:55] a python daemon which acts as a gateway between Gerrit and Jenkins. [23:10:56] Awesome [23:11:00] mlitn: ping? [23:11:25] here [23:11:58] Dereckson: ok [23:12:04] !jouncebot is a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [23:12:05] Key was added [23:12:38] mlitn: fine, I've CR+2 your changes. We've a rather long gate-and-submit queue. Count a good 40 minutes of delay. [23:12:58] sure, thanks [23:13:05] Zuul looks rather stuck. [23:13:10] Oh, yeah. [23:13:12] twentyafterfour: 23:09:19 < icinga-wm> RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:13:29] !jenkins is a Java software to assist in building a continuous integration system. https://wikitech.wikimedia.org/wiki/Jenkins [23:13:30] Key was added [23:13:35] !mira is the deployment server in codfw [23:13:35] Key was added [23:13:52] kaldari: ping? [23:15:04] Okay, I'll sheperd 290529. [23:15:30] (03PS2) 10Dereckson: Set Tamil projects to use uca-ta collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290529 (https://phabricator.wikimedia.org/T75453) (owner: 10Kaldari) [23:15:37] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:15:38] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290529 (https://phabricator.wikimedia.org/T75453) (owner: 10Kaldari) [23:16:15] Dereckson: I'm here [23:16:17] (03Merged) 10jenkins-bot: Set Tamil projects to use uca-ta collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290529 (https://phabricator.wikimedia.org/T75453) (owner: 10Kaldari) [23:16:32] Cool. I'm merging the uca-ta change. [23:16:38] thanks! [23:17:28] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.027 second response time [23:20:04] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Set Tamil projects to use uca-ta collation (T75453) (duration: 02m 18s) [23:20:05] T75453: Tamil sort order - https://phabricator.wikimedia.org/T75453 [23:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:25] not synced on mw1140.eqiad.wmnet [23:20:48] Kaldari: here you are. You take care of running the update script on Terbium or do you want I run that? ^ [23:20:48] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:21:28] Dereckson: No need to run it... [23:21:37] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:21:44] Dereckson: It's going to run for all the wikis on the 28th. [23:22:35] k [23:22:49] PROBLEM - nutcracker port on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:22:57] PROBLEM - configured eth on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:28] PROBLEM - puppet last run on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:36] James_F: https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55/4135/console <- no stuck [23:23:37] Dereckson: Sorry, 26th actually. See https://phabricator.wikimedia.org/T129411#2323592. [23:23:38] PROBLEM - Check size of conntrack table on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:49] PROBLEM - nutcracker process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:57] PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:24:12] Dereckson: Not after I killed the build on https://gerrit.wikimedia.org/r/#/c/290371/ :-) [23:24:14] kaldari: yes, so it's not relevant to run it [23:24:25] thanks to have takingcare of this issue so [23:24:28] PROBLEM - RAID on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:24:38] PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:24:51] Dereckson: Nope, no need to run it, as it will be run in a couple days anyway. [23:26:18] PROBLEM - Disk space on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:26:19] PROBLEM - salt-minion processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:27:10] PROBLEM - dhclient process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:28:09] PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:31:00] jdlrobson: 290587 merged, 290598 still pending [23:35:09] (03PS1) 10Yuvipanda: Add base PHP container & php web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290607 [23:35:43] jdlrobson: hey https://integration.wikimedia.org/ci/job/mwext-qunit/16611/console <- does that look good to you? [23:39:35] Dereckson: in what sense? [23:39:48] that's normal it's stucked at MySQL install step? [23:40:23] Dereckson: what am i looking at? [23:40:46] Previous build: [23:40:47] 23:19:13 [mwext-qunit] $ /bin/bash -xe /tmp/hudson7132546295699713620.sh [23:40:53] + /srv/deployment/integration/slave-scripts/bin/mw-install-mysql.sh [23:40:55] Your build : [23:40:58] 23:19:13 PHP 5.5.9-1ubuntu4.16 is installed. [23:41:12] Is this the job for https://gerrit.wikimedia.org/r/#/c/290598/ ? [23:41:16] 23:19:19 [mwext-qunit] $ /bin/bash -xe /tmp/hudson3722309187810733348.sh [23:41:19] yep [23:41:30] how long has it been running? [23:41:38] 23 minutes [23:41:47] mwext-qunit SUCCESS in 2m 00s [23:41:53] so yes it's stucked [23:42:03] yeh thats not normal. try submitting again? [23:42:37] RECOVERY - Disk space on mw1140 is OK: DISK OK [23:42:37] RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:42:38] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm [23:42:41] (03PS1) 10Yuvipanda: docker: Switch image name to wikimedia-jessie [puppet] - 10https://gerrit.wikimedia.org/r/290608 [23:42:58] RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212 [23:42:58] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 67546 bytes in 3.584 second response time [23:43:14] (03PS2) 10Yuvipanda: docker: Switch image name to wikimedia-jessie [puppet] - 10https://gerrit.wikimedia.org/r/290608 [23:43:18] RECOVERY - RAID on mw1140 is OK: OK: no RAID installed [23:43:18] RECOVERY - configured eth on mw1140 is OK: OK - interfaces up [23:43:28] RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [23:43:33] I'm doing a scap pull on mw1140, so we can get unmerged kaldari change on it. [23:43:37] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.026 second response time [23:43:40] s/unmerged/undeployed [23:43:48] RECOVERY - Check size of conntrack table on mw1140 is OK: OK: nf_conntrack is 8 % full [23:43:57] RECOVERY - dhclient process on mw1140 is OK: PROCS OK: 0 processes with command name dhclient [23:43:57] RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:44:08] RECOVERY - DPKG on mw1140 is OK: All packages OK [23:44:09] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 59 minutes ago with 0 failures [23:44:34] (03CR) 10Yuvipanda: [C: 032] docker: Switch image name to wikimedia-jessie [puppet] - 10https://gerrit.wikimedia.org/r/290608 (owner: 10Yuvipanda) [23:45:33] Krinkle: I'm going to be futzing with the way nagf is deployed again today. the code's going to come back to NFS. [23:46:29] !log scap pull on mw1140 (duration: 02m 42s) [23:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:50:01] can we add labtestweb wiki to scap list? [23:50:08] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: puppet fail [23:50:23] it should be a copy of wikitech wiki [23:50:30] mutante: Krenair or andrewbogott would know [23:50:54] Krenair made the patch :) added andrew to it :) [23:53:23] Dereckson: what's going on with https://gerrit.wikimedia.org/r/#/c/290598/ ? [23:53:26] integration.wikimedia.org does it work for you? [23:53:32] https [23:53:44] just sync the first patch if there's a chance this isn't going to work [23:53:47] now it does [23:53:48] the important thing is to stop the fatal [23:54:29] jdlrobson: we lost some minutes with Paladox tweaking on Gerrit [23:54:37] I'm manually merging it [23:54:42] great [23:58:02] jdlrobson: syncing it [23:58:21] !log dereckson@tin Synchronized php-1.28.0-wmf.2/extensions/MobileFrontend/includes/MobileFormatter.php: "Pi" article on mobile en.wp throws a 503 fatal (T135923, [[Gerrit:290587]] + [[Gerrit:290598]]) (duration: 00m 24s) [23:58:22] T135923: "Pi" article on mobile en.wp throws a 503 fatal - https://phabricator.wikimedia.org/T135923 [23:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:58:45] !log dereckson@tin Synchronized php-1.28.0-wmf.2/extensions/MobileFrontend/tests/phpunit/MobileFormatterTest.php: "Pi" article on mobile en.wp throws a 503 fatal (no-op, [[Gerrit:290587]]) (duration: 00m 23s) [23:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:58:56] jdlrobson: here you are ^ [23:59:04] Dereckson: awesome [23:59:06] fix verified! [23:59:10] :) [23:59:13] thanks a bunch [23:59:14] Sorry for the delay. [23:59:38] Jenkins isn't always very cooperative with emergencies. [23:59:42] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/2896/" [puppet] - 10https://gerrit.wikimedia.org/r/290606 (https://phabricator.wikimedia.org/T124812) (owner: 10Dzahn)