[00:00:04] Cool it's not generalized. I can repro in a private tab on Chrome and on Firefox for my account. Both of you have a TFA? I haven't enabled it, that could be a difference. [00:00:05] (03CR) 10BryanDavis: role::toollabs::merlbot_proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293223 (https://phabricator.wikimedia.org/T137235) (owner: 10BryanDavis) [00:01:33] Dereckson: Yeah, I have two-factor enabled (needed for the new Labs stuff). [00:01:34] Dereckson: I don't [00:02:13] Dereckson: do you have an exact timestamp for a login attempt? [00:03:01] nothing more precise than 23:58 or 23:59 UTC, but I can relogin and check time. [00:03:52] (03CR) 10Yuvipanda: role::toollabs::merlbot_proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293223 (https://phabricator.wikimedia.org/T137235) (owner: 10BryanDavis) [00:04:24] tgr: 00:04:14 [00:07:05] Session "{session}" requested with invalid Token cookie. [00:10:08] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2362757 (10Paladox) @mmodell thanks. [00:10:22] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.008 second response time [00:11:04] https://doc.wikimedia.org/mediawiki-core/master/php/CookieSessionProvider_8php_source.html a different hash than expected so [00:15:08] Dereckson: can you send me your token cookie in private? [00:15:17] or open an NDA task and put it there [00:16:12] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.021 second response time [00:20:11] Task created, I'm checking the token cookie [00:20:36] Dereckson: nevermind, your token is set to invalid [00:21:00] there should be a better log message for that [00:21:23] I'll reset it for you, it's too late for me to try to figure out what caused it [00:29:02] (03PS1) 10Papaul: adding install params for mw2215 to mw2250 Bug: TT135466 [puppet] - 10https://gerrit.wikimedia.org/r/293246 [00:30:52] Dereckson: I've set a token for you manually via SQL [00:31:06] not sure how long that lasts but can you try logging in again? [00:32:05] Works. [00:32:12] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 07Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#2362837 (10debt) [00:32:15] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#2362835 (10debt) 05Open>03Resolved Looks like this is resolved - closing. [00:33:09] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2362846 (10debt) [00:33:10] Dereckson: probably only until your user record gets updated [00:33:11] 06Operations, 03Discovery-Search-Sprint, 13Patch-For-Review: Check Icinga alert on CirrusSearch response time - https://phabricator.wikimedia.org/T134852#2362844 (10debt) 05Open>03Resolved Looks like this is resolved - closing. [00:33:25] but that should be long enough for me to get some sleep :) [00:34:04] and for me to write my change to [[Deployments]] :) Good night. [00:41:33] (03PS2) 10Papaul: adding install params for mw2215 to mw2250 Bug: TT135466 [puppet] - 10https://gerrit.wikimedia.org/r/293246 [00:46:39] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2362908 (10Papaul) [00:50:31] (03PS3) 10Papaul: adding install params for mw2215 to mw2250 Bug: TT135466 [puppet] - 10https://gerrit.wikimedia.org/r/293246 [00:51:28] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2362916 (10Papaul) [00:52:05] (03CR) 10Dzahn: [C: 032] adding install params for mw2215 to mw2250 Bug: TT135466 [puppet] - 10https://gerrit.wikimedia.org/r/293246 (owner: 10Papaul) [00:55:55] 06Operations, 13Patch-For-Review: install font packages on all appservers, not just imagescalers (was: Install fonts-wqy-zenhei on all mediawiki app servers) - https://phabricator.wikimedia.org/T84777#931171 (10Cwek) >>! In T84777#2328857, @Dzahn wrote: > @joe If "Timelines aren't rendered on image scalers. Th... [01:03:03] 06Operations, 13Patch-For-Review: install font packages on all appservers, not just imagescalers (was: Install fonts-wqy-zenhei on all mediawiki app servers) - https://phabricator.wikimedia.org/T84777#2362922 (10Cwek) In this issue, Do we have installed the fonts successfully? Or Do we need other support, like... [01:08:07] 06Operations, 13Patch-For-Review: install font packages on all appservers, not just imagescalers (was: Install fonts-wqy-zenhei on all mediawiki app servers) - https://phabricator.wikimedia.org/T84777#2362923 (10Dzahn) @Cwek The thing is that currently all the fonts are just installed on imagescalers (the serv... [01:11:14] (03CR) 10Dzahn: [V: 032] "no jenkins-bot" [puppet] - 10https://gerrit.wikimedia.org/r/293246 (owner: 10Papaul) [01:18:08] (03PS14) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [01:20:41] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1236.14 seconds [01:23:05] (03CR) 10BryanDavis: "Example usage via java at https://phabricator.wikimedia.org/P3219" [puppet] - 10https://gerrit.wikimedia.org/r/293223 (https://phabricator.wikimedia.org/T137235) (owner: 10BryanDavis) [01:32:58] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2362934 (10Dzahn) i setup a labs environment to test this as much as we like without needing prod see this: http://... [01:36:29] (03PS3) 10Dzahn: git.wikimedia.org -> Diffusion redirects [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) [01:40:13] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2362935 (10Dzahn) This works for the first row in the table: ``` RewriteCond %{HTTP_HOST} =git.wikimedia.org Rew... [01:51:01] (03PS3) 10Yuvipanda: tools: Add role::toollabs::merlbot_proxy [puppet] - 10https://gerrit.wikimedia.org/r/293223 (https://phabricator.wikimedia.org/T137235) (owner: 10BryanDavis) [01:51:13] (03PS4) 10Yuvipanda: tools: Add role::toollabs::merlbot_proxy [puppet] - 10https://gerrit.wikimedia.org/r/293223 (https://phabricator.wikimedia.org/T137235) (owner: 10BryanDavis) [01:51:20] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add role::toollabs::merlbot_proxy [puppet] - 10https://gerrit.wikimedia.org/r/293223 (https://phabricator.wikimedia.org/T137235) (owner: 10BryanDavis) [02:00:15] (03PS15) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [02:02:44] (03PS16) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [02:03:31] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:04:29] (03PS17) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [02:05:22] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.018 second response time [02:05:31] (03PS18) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [02:06:06] (03PS19) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [02:12:59] (03PS20) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [02:14:20] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.26 seconds [02:19:20] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:10] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.015 second response time [02:22:19] (03PS21) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [02:23:12] (03PS22) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [02:25:11] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [02:26:42] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [02:27:02] (03PS23) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [02:29:01] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.4) (duration: 11m 11s) [02:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:25] (03PS24) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [02:35:11] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.012 second response time [02:35:57] (03PS25) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [02:37:11] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 2.733 second response time [02:37:11] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:38:37] (03PS26) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [02:40:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:50:05] eh, any ops around? [02:50:20] gallium's fs is read only for some reason [02:50:25] causing jenkins and CI to be broken [02:50:31] legoktm@gallium:/var/lib/jenkins$ touch test [02:50:31] touch: cannot touch `test': Read-only file system [02:51:48] !log / on gallium is currently read-only for some reason [02:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:57] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.5) (duration: 06m 49s) [02:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:52:20] yuvipanda: still around? [02:53:10] hey legoktm [02:53:13] what happened [02:53:14] oh [02:53:22] legoktm: does this completely kill jenkins? [02:53:39] it's still running, but it can't do anything [02:53:49] because triggering jobs requires writing to the file system [02:53:53] legoktm: hmm, I *think* it might be hardware failure [02:53:59] I see an madm alert for it [02:54:00] well...shit. [02:54:23] lemme file a bug then [02:54:26] legoktm: yeah [02:54:36] legoktm: I don't want to reboot in that state since it might not come up at all [02:54:46] legoktm: I suppose that'll fuck up more things than it being readonly? [02:54:59] probably [02:56:30] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2362975 (10Legoktm) [02:56:39] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2362987 (10Legoktm) p:05Triage>03Unbreak! [02:56:41] yuvipanda: ^ [02:57:51] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2362989 (10yuvipanda) I see an mdm alert for it: ``` This is an automatically generated mail message from mdadm running on gallium A Fai... [02:58:28] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jun 8 02:58:28 UTC 2016 (duration 6m 31s) [02:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:58:36] mutante: robh either of you around? [02:58:45] legoktm: do you think this is a page-worthy issue now? [02:58:55] * yuvipanda still doesn't fully understand how serious it is, etc. [02:59:20] all of CI is broken [03:00:24] which uh means, patches aren't going to get merged and if people end up deploying stuff, it's going to be without tests and linters :/ [03:00:30] legoktm: is that a 'yes'? if I don't page someone now the europeans will probably show up in like 3h and start taking a look. [03:01:32] yuvipanda: given that no one noticed in the past ~3h it's been broken, I think waiting another 3h should be okay [03:01:42] legoktm: ok! [03:03:01] thanks for looking into it [03:07:48] legoktm: I'm running an fsck -n just to see what it's going to report [03:08:01] legoktm: might take a while [03:09:00] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [03:09:45] * legoktm nods [03:11:01] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 3.872 second response time [03:11:11] legoktm: mind if I email ops@? or do you want to? [03:11:27] go for it [03:11:51] you know more of the details than I do :) [03:17:07] legoktm: done [03:21:38] (03PS27) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [03:25:18] yuvipanda: thanks. I'm going to go for dinner now, I'll be back in an hour and a half-ish [03:25:26] legoktm: ok. [03:31:27] (03PS28) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [03:51:10] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2363081 (10yuvipanda) fsck completed with: ``` root@gallium:/home/yuvipanda# fsck.ext3 -n /dev/md0 | tee fsck tee: fsck: Read-only file s... [03:57:30] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Puppet has 1 failures [04:03:38] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 2007 MB (3% inode=96%): /srv/swift-storage/sdl1 116494 MB (6% inode=91%) [04:04:48] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.384 second response time [04:10:58] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2363110 (10yuvipanda) I suspect rebooting + fsck on reboot will fix this, but I'm also aware that I haven't done this before, and that gal... [04:12:29] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 2.305 second response time [04:22:09] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: puppet fail [04:23:28] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [04:39:02] RECOVERY - Hadoop DataNode on analytics1049 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [04:44:51] PROBLEM - Hadoop DataNode on analytics1049 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [04:48:42] (03CR) 10BryanDavis: [WIP] Kubernetes backend (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 (owner: 10Yuvipanda) [04:49:01] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.863 second response time [04:50:12] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [04:50:52] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.455 second response time [04:51:00] (03PS29) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [04:58:27] (03PS15) 10GWicke: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) [04:59:01] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2362975 (10MoritzMuehlenhoff) mdadm shows /dev/sda2 as failed, so it needs to be removed from /dev/md0 and replaced. Let's wait for Antoin... [05:00:09] Thanks moritzm and yuvipanda and legoktm [05:09:11] RECOVERY - puppet last run on analytics1049 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [05:28:02] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [05:44:41] PROBLEM - puppet last run on analytics1049 is CRITICAL: CRITICAL: Puppet has 1 failures [05:46:52] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:54:01] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2363169 (10Jalexander) Thanks all! [05:54:33] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.015 second response time [06:00:42] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:04:52] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 1.077 second response time [06:09:03] RECOVERY - puppet last run on analytics1049 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:10:41] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.015 second response time [06:12:41] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 5.700 second response time [06:16:56] (03PS1) 10Awight: Reinstate Adam Wight's SSH key "snack" [puppet] - 10https://gerrit.wikimedia.org/r/293259 (https://phabricator.wikimedia.org/T137162) [06:17:02] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: New SSH key for AWight - https://phabricator.wikimedia.org/T137162#2363189 (10awight) 05Resolved>03Open @faidon Thank you for the quick work! I still use the "snack" key though--following up with patch to reinstate it, in case that's helpful... [06:17:17] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: New SSH key for AWight - https://phabricator.wikimedia.org/T137162#2363191 (10awight) https://gerrit.wikimedia.org/r/293259 [06:24:19] analytics1049 has a bad disk, working on it [06:25:30] yeah: org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 11, volumes configured: 12, volumes failed: 1, volume failures tolerated: 0 [06:30:02] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 2068 MB (3% inode=96%) [06:31:42] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:32:21] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:32] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:02] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:22] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:32] RECOVERY - Hadoop DataNode on analytics1049 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [06:42:52] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:44:16] (03PS2) 10Muehlenhoff: Enable base::firewall for palladium [puppet] - 10https://gerrit.wikimedia.org/r/292345 [06:44:41] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [06:46:31] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:21] <_joe_> relforge1001?? [06:48:22] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [06:49:10] Chris added that yesterday evening, relevance testing for discovery [06:50:31] _joe_: new boxes https://phabricator.wikimedia.org/T137256 [06:50:32] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:50:57] <_joe_> p858snake: yeah I figured it out in the meanwhile, thanks anyways :) [06:52:03] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: Puppet has 1 failures [06:52:18] 06Operations, 06Performance-Team, 06Services, 07Availability: Create restbase BagOStuff subclass - https://phabricator.wikimedia.org/T137272#2363223 (10aaron) [06:52:23] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.011 second response time [06:54:31] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.026 second response time [06:56:32] 06Operations, 10ops-eqiad, 10Analytics-Cluster: analyitics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2363239 (10elukey) [06:57:43] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:01] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:58:01] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:22] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:32] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:51] !log enabling ferm on palladium (will lead to temporary puppet failures) [06:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:00:22] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall for palladium [puppet] - 10https://gerrit.wikimedia.org/r/292345 (owner: 10Muehlenhoff) [07:06:45] PROBLEM - puppet last run on wtp1003 is CRITICAL: CRITICAL: puppet fail [07:07:15] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.009 second response time [07:07:34] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: puppet fail [07:07:44] PROBLEM - puppet last run on ms-be1011 is CRITICAL: CRITICAL: puppet fail [07:07:54] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: puppet fail [07:07:55] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: puppet fail [07:08:05] PROBLEM - puppet last run on ms-be2018 is CRITICAL: CRITICAL: puppet fail [07:08:05] PROBLEM - puppet last run on mc2003 is CRITICAL: CRITICAL: puppet fail [07:08:06] PROBLEM - Host palladium is DOWN: PING CRITICAL - Packet loss = 100% [07:09:35] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: puppet fail [07:09:45] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 15 failures [07:09:45] RECOVERY - Host palladium is UP: PING OK - Packet loss = 0%, RTA = 1.55 ms [07:09:55] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: puppet fail [07:09:56] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: puppet fail [07:10:04] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: puppet fail [07:10:05] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: puppet fail [07:10:14] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: puppet fail [07:10:15] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: puppet fail [07:10:15] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: puppet fail [07:10:25] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Puppet has 2 failures [07:10:35] PROBLEM - puppet last run on mc2011 is CRITICAL: CRITICAL: puppet fail [07:10:44] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [07:10:45] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: puppet fail [07:10:55] PROBLEM - puppet last run on wtp1009 is CRITICAL: CRITICAL: puppet fail [07:10:55] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail [07:10:56] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: puppet fail [07:11:00] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2362975 (10Joe) Please don't reboot the machine: while `/dev/sda2` seems to be failing, we also have `/dev/sdc` reporting I/O errors ```... [07:11:15] PROBLEM - puppet last run on mc1003 is CRITICAL: CRITICAL: puppet fail [07:11:15] PROBLEM - puppet last run on mc2010 is CRITICAL: CRITICAL: puppet fail [07:11:25] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [07:11:26] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [07:11:34] PROBLEM - puppet last run on mc1015 is CRITICAL: CRITICAL: puppet fail [07:11:35] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: puppet fail [07:11:46] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: puppet fail [07:12:04] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: puppet fail [07:12:04] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: puppet fail [07:12:04] PROBLEM - puppet last run on mw1105 is CRITICAL: CRITICAL: puppet fail [07:12:05] PROBLEM - puppet last run on mw1006 is CRITICAL: CRITICAL: puppet fail [07:12:14] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: puppet fail [07:12:15] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: puppet fail [07:12:24] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: puppet fail [07:12:35] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: puppet fail [07:12:36] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: puppet fail [07:12:45] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail [07:13:05] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: puppet fail [07:13:14] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 7.380 second response time [07:13:15] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: puppet fail [07:13:16] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: puppet fail [07:13:25] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: puppet fail [07:13:35] PROBLEM - puppet last run on mc1005 is CRITICAL: CRITICAL: puppet fail [07:13:55] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: puppet fail [07:13:56] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: puppet fail [07:14:04] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: puppet fail [07:14:05] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 47 failures [07:14:14] PROBLEM - puppet last run on cp2020 is CRITICAL: CRITICAL: puppet fail [07:14:15] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: puppet fail [07:14:24] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: puppet fail [07:14:34] PROBLEM - puppet last run on cp1074 is CRITICAL: CRITICAL: puppet fail [07:14:35] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: puppet fail [07:14:35] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CRITICAL: puppet fail [07:14:54] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: puppet fail [07:15:14] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:16:05] (03PS1) 10Muehlenhoff: Disable ferm on palladium again [puppet] - 10https://gerrit.wikimedia.org/r/293261 [07:16:27] (03CR) 10Muehlenhoff: [C: 032 V: 032] Disable ferm on palladium again [puppet] - 10https://gerrit.wikimedia.org/r/293261 (owner: 10Muehlenhoff) [07:17:05] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 378 bytes in 2.647 second response time [07:18:24] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: Puppet has 3 failures [07:18:25] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Puppet has 3 failures [07:18:54] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Puppet has 3 failures [07:19:24] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 12 failures [07:20:06] PROBLEM - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Puppet has 43 failures [07:21:04] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:21:14] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:15] PROBLEM - puppet last run on mw2122 is CRITICAL: CRITICAL: Puppet has 1 failures [07:21:35] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 3 failures [07:21:45] PROBLEM - puppet last run on wtp2006 is CRITICAL: CRITICAL: Puppet has 1 failures [07:22:06] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Puppet has 2 failures [07:22:14] PROBLEM - puppet last run on mw1023 is CRITICAL: CRITICAL: Puppet has 1 failures [07:22:14] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: Puppet has 1 failures [07:22:15] PROBLEM - puppet last run on elastic1002 is CRITICAL: CRITICAL: Puppet has 2 failures [07:22:25] PROBLEM - puppet last run on db1030 is CRITICAL: CRITICAL: Puppet has 3 failures [07:23:05] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 1.698 second response time [07:23:15] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Puppet has 3 failures [07:23:15] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:23:34] PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Puppet has 3 failures [07:24:45] PROBLEM - puppet last run on mw2169 is CRITICAL: CRITICAL: Puppet has 1 failures [07:25:26] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:25:35] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:26:24] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Puppet has 4 failures [07:26:35] PROBLEM - puppet last run on mw2155 is CRITICAL: CRITICAL: Puppet has 3 failures [07:26:45] PROBLEM - puppet last run on mw2124 is CRITICAL: CRITICAL: Puppet has 3 failures [07:26:56] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Puppet has 3 failures [07:26:56] PROBLEM - puppet last run on mw2175 is CRITICAL: CRITICAL: Puppet has 3 failures [07:27:05] PROBLEM - puppet last run on mw2187 is CRITICAL: CRITICAL: Puppet has 2 failures [07:27:15] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 3 failures [07:27:24] PROBLEM - puppet last run on mw2108 is CRITICAL: CRITICAL: Puppet has 3 failures [07:27:24] PROBLEM - puppet last run on mw2074 is CRITICAL: CRITICAL: Puppet has 3 failures [07:27:34] PROBLEM - puppet last run on mw1108 is CRITICAL: CRITICAL: Puppet has 3 failures [07:27:35] PROBLEM - puppet last run on mw2071 is CRITICAL: CRITICAL: Puppet has 4 failures [07:27:44] PROBLEM - puppet last run on mw1186 is CRITICAL: CRITICAL: Puppet has 3 failures [07:27:54] PROBLEM - puppet last run on mw2205 is CRITICAL: CRITICAL: Puppet has 3 failures [07:27:55] PROBLEM - puppet last run on mw1167 is CRITICAL: CRITICAL: Puppet has 3 failures [07:28:05] PROBLEM - puppet last run on mw2159 is CRITICAL: CRITICAL: Puppet has 2 failures [07:28:15] PROBLEM - puppet last run on mw2121 is CRITICAL: CRITICAL: Puppet has 3 failures [07:28:45] PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 3 failures [07:29:35] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:32:04] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:32:06] RECOVERY - puppet last run on mw2205 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:32:45] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:32:46] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [07:32:54] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:32:55] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:32:55] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:33:14] RECOVERY - puppet last run on wtp2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:33:16] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:33:16] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:33:24] RECOVERY - puppet last run on db2063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:33:24] RECOVERY - puppet last run on mc2003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:33:25] RECOVERY - puppet last run on mw1105 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:33:25] RECOVERY - puppet last run on mw2169 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:33:25] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:33:34] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [07:33:35] RECOVERY - puppet last run on mw2122 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:33:44] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:33:45] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:33:45] RECOVERY - puppet last run on db1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:33:45] RECOVERY - puppet last run on elastic1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:33:54] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:33:55] RECOVERY - puppet last run on mw1108 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:33:55] RECOVERY - puppet last run on mw2071 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [07:33:56] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2363275 (10Joe) ``` mdadm --detail /dev/md0 /dev/md0: Version : 0.90 Creation Time : Thu Aug 25 21:30:22 2011 Raid Level :... [07:34:05] RECOVERY - puppet last run on mw2121 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:34:15] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:34:15] RECOVERY - puppet last run on mw1186 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:34:26] RECOVERY - puppet last run on mc2010 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:34:35] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:34:35] RECOVERY - puppet last run on mw1023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:34:35] RECOVERY - puppet last run on mw2074 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:34:45] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:34:54] RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:34:54] RECOVERY - puppet last run on mc1003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:34:54] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:34:55] RECOVERY - puppet last run on mw1167 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:34:55] RECOVERY - puppet last run on mw2159 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:35:14] RECOVERY - puppet last run on mw2124 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:35:15] RECOVERY - puppet last run on mw2155 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:35:15] RECOVERY - puppet last run on mw2187 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:35:15] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:35:25] RECOVERY - puppet last run on strontium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:35:25] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:35:26] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:35:34] RECOVERY - puppet last run on mw2108 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:35:35] RECOVERY - puppet last run on ms-be1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:35:35] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:35:35] RECOVERY - puppet last run on mw2175 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:35:45] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:35:45] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:35:54] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:35:54] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:36:04] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:36:06] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:36:14] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [07:36:34] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:36:36] RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:36:44] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:36:45] RECOVERY - puppet last run on mc2011 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [07:37:04] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:37:15] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:37:15] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:37:25] RECOVERY - puppet last run on mc1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:37:35] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:37:44] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:37:46] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:37:54] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:38:04] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:38:25] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:38:36] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:38:36] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:38:44] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:38:44] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:38:55] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:38:56] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:38:56] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:39:14] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [07:39:15] RECOVERY - puppet last run on cp1050 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [07:39:24] RECOVERY - puppet last run on cp2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:39:44] RECOVERY - puppet last run on mc1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:39:45] RECOVERY - puppet last run on cp1074 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:39:54] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [07:40:04] RECOVERY - puppet last run on ms-be1012 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:40:04] RECOVERY - puppet last run on ms-be2019 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [07:40:34] RECOVERY - puppet last run on ms-be2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:41:05] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:47:46] (03PS4) 10Giuseppe Lavagetto: Change-Prop: Enable file transclusions updates. [puppet] - 10https://gerrit.wikimedia.org/r/292899 (owner: 10Ppchelko) [07:48:10] <_joe_> mobrovac: I guess we won't wait for CI ;) [07:48:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Change-Prop: Enable file transclusions updates. [puppet] - 10https://gerrit.wikimedia.org/r/292899 (owner: 10Ppchelko) [07:49:35] can't even sudo journalctl on gallium [07:49:48] <_joe_> jzerebecki: it's a precise [07:49:59] <_joe_> how would journalctl work there? [07:49:59] oh [07:50:18] i was misslead by sudo -l [07:50:24] <_joe_> jzerebecki: jynus is copying all the relevant data elsewhere [07:50:31] excellent [07:50:43] <_joe_> but I don't know when CI will be back tbh [07:51:02] <_joe_> we tried reaching out to hashar but no luck [07:51:49] <_joe_> if no one gets here in 20 mins, I'll escalate to greg [07:52:34] I could try to give you a hand but I might need to make sure I'm not needed elsewhere [07:53:49] !log change-prop deploying 84d56e53a [07:53:50] <_joe_> jzerebecki: thanks, help is always appreciated ofc [07:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:54:52] _joe_: stopping and starting CP on scb1002 [07:58:25] ACKNOWLEDGEMENT - MD RAID on gallium is CRITICAL: CRITICAL: Active: 1, Working: 1, Failed: 1, Spare: 0 Giuseppe Lavagetto T137265 [07:58:25] ACKNOWLEDGEMENT - puppet last run on gallium is CRITICAL: CRITICAL: Puppet last ran 8 hours ago Giuseppe Lavagetto T137265 [08:04:56] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jan Dittrich - https://phabricator.wikimedia.org/T136560#2363301 (10Jan_Dittrich) Hello @jcrespo, thanks for picking this up. I saw that my LDAP was not connected to my Phabricator account, but now it is (WMDE-jand) [08:07:22] is there a replacement for gallium? [08:09:59] <_joe_> jzerebecki: nope for now jynus is copying data to einsteinium [08:10:21] <_joe_> which is a spare we've recently installed for (IIRC) icinga upgrade/replacement [08:11:57] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2363305 (10Paladox) Thanks :) [08:14:02] !log Jenkins has bunch of executors dead for what ever reason preventing jobs from running :( [08:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:14:23] ha ha [08:14:24] <_joe_> hashar: read us [08:14:29] <_joe_> hashar: gallium is dead [08:14:32] oh really [08:14:35] only a "bunch" :-) [08:14:42] <_joe_> hashar: didn't greg send you an email? [08:14:48] probably, I have just connected [08:14:51] <_joe_> https://phabricator.wikimedia.org/T137265 [08:15:06] <_joe_> hashar: we need to rebuild gallium, most probably [08:15:19] so disk is dead ? [08:15:38] !log lowering down webrequest_text kafka topic retention time from 7 days to 4 days to free disk space (T136690) [08:15:39] T136690: Kafka 0.9's partitions rebalance causes data log mtime reset messing up with time based log retention - https://phabricator.wikimedia.org/T136690 [08:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:15:58] <_joe_> hashar: 1 disk dead, the other has hardware errors [08:16:11] <_joe_> I don't really know what happened there, but we need to reinstall [08:16:17] given that box had heavy I/O for roughly 5 years that was to be expected [08:16:23] <_joe_> jynus is copying data to a backup on einsteinium [08:16:35] <_joe_> hashar: well we had a raid, but something was not done correctly [08:16:48] not a big deal [08:16:51] need advice on what things are important and what not [08:16:58] let me check quickly [08:17:05] <_joe_> hashar: not a big deal? [08:17:15] I have copied all of /srv except org [08:17:18] in the sense that most of the data there is generated [08:17:23] almost everything is under puppet [08:17:29] the exception being Jenkins itself [08:17:34] and I am now copiying /var/lob/jenkins [08:17:45] *lib [08:17:48] which is mostly about build history, and I think we can afford to loose the Jenkins build history [08:18:03] yeah /var/lib/jenkins [08:18:08] though that one is a bit huge, with ton of files [08:18:19] I see your home has a lot of files [08:18:29] <_joe_> hashar: so, should we just get a spare and reinstall it as a gallium replacement? [08:18:29] please advise if some are skippable [08:18:53] _joe_: yup that is what paravoid has been pushing on [08:18:58] there's an older ticket for setup of cobalt as the gallium replacement, maybe that box can be used? https://phabricator.wikimedia.org/T95959 [08:19:03] I wanted to poke with him to find a good strategy to migrate out of gallium entirely [08:19:18] hashar, what OS? [08:19:23] Jessie [08:19:31] <_joe_> cobalt? [08:19:32] can we do jessie right now? [08:19:43] or it is an aspiration for the future? [08:19:56] Jessie would be fine I believe [08:20:16] gallium "just" have Jenkins / Zuul and apache vhost to host integration.wm.o and doc.wm.o [08:20:40] <_joe_> hashar: so at least apache will need converting, but yes let's do it [08:20:41] Jenkins is all about copy pasting /var/lib/jenkins + installing the deb package and it should come back. [08:20:41] what I mean is that stupind things can be huge blockers, believe us [08:20:50] Zuul has a Debian package, though I havent updated it for Jessie [08:21:24] <_joe_> hashar: what jynus is trying to tell you is that if we go with jessie, it will take probably a couple of days before CI is up [08:21:30] <_joe_> possibly more [08:21:31] seems T95959 has never been updated, cobalt is marked as decomissioned in racktables [08:21:35] T95959: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959 [08:21:36] wanna save /srv/ but /srv/ssd can be skipped entirely [08:21:51] hashar, is /srv/org important? [08:21:58] <_joe_> and btw, I am off tomorrow and friday [08:22:12] I am off tomorrow too [08:22:19] jynus: potentially we can regenerate part of its data. But if we can take a snapshot that will save a lot of time [08:22:34] hashar, ok, I will copy all /srv [08:22:48] jynus: you can skip /srv/ssd [08:23:04] that is ~ 10 GBytes of git repo we dont need [08:23:11] that is 8GB, it is not useful to skip [08:23:19] it it has less than 30GB, we do not care [08:23:29] we just copy all [08:24:34] moritzm: yeah cobalt was "allocated" a year or so ago. But that just a placeholder [08:24:53] if we have a spare Jessie floating around with an IP in prod that would do [08:24:54] can be public [08:24:56] err [08:24:59] the IP can be private [08:25:36] <_joe_> hashar: let me take a look at our spares [08:26:15] multatuli is a jessie host which is up and running and unused [08:26:33] moritzm, disk space and configuration? [08:26:43] ah, but it's in Amsterdam, probably better in eqiad [08:26:49] yes [08:27:09] yeah needs eqiad [08:27:22] since the whole CI interact a lot with Gerrit (eqiad) and OpenStack labs (eqiad) as well [08:27:35] the added network roundtrip latency cripples it :( [08:27:48] or we can go with Ganeti VM if that is easier to setup [08:28:09] we had gallium has a real hardware in prod with public IP because that is what we did 5 years ago [08:28:12] but the io that we used ssds for? [08:28:29] the ssd is to speed up the zuul-merger , we have one on scandium.eqiad.wmnet [08:28:35] so we dont even need to migrate zuul-merger [08:28:55] <_joe_> I am looking at the spare servers spreadsheet [08:28:57] !log stopping Jenkins / zuul / zuul-merger / puppet on gallium [08:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:29:35] how about wmf4723 or the similar systems? warranty until 2018 and shouild be fine spec-wise [08:29:43] we can still add the SSD subsequently [08:29:48] the SSD are not needed [08:29:52] would be fine without it [08:29:59] <_joe_> moritzm: it has 4 tb of disk, which is not needed AFAICT [08:30:22] ~500Gbytes disk + some cpu power and ideally 8 Gbytes of RAM [08:30:37] <_joe_> moritzm: I'd say wmf4746 [08:31:41] _joe_: maybe I'm misreading the spreadsheet, but it has a single disk only? [08:31:49] jynus: would you mind saving /var/lib/zuul as well ? it has a few ssh private keys though they should be in puppet [08:31:57] <_joe_> moritzm: yeah, you're right [08:32:03] <_joe_> it's me not reading it correctly :P [08:32:23] 4579? [08:32:39] hashar, please add to that ticket the list of all things that have and doesn't need to be saved [08:32:47] not too overpowerd and in warranty until 2017 [08:33:00] <_joe_> moritzm: I eyed it, but it wasn't "Potential allocation on T133099"? [08:33:01] PROBLEM - zuul_gearman_service on gallium is CRITICAL: Connection refused [08:33:03] ah, but it's potentially allocated [08:33:10] that icinga alarm is me [08:33:22] <_joe_> so let's go with wmf4723 [08:33:22] jynus: doing [08:33:33] <_joe_> hoping it's installable [08:33:42] PROBLEM - zuul_merger_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [08:33:48] <_joe_> hashar: so, are you ok with having CI down for 2-3 days [08:33:54] <_joe_> if so, we can move to jessie [08:34:01] <_joe_> else, I strongly advise against it [08:34:02] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [08:34:04] * legoktm is going to sleep now, o/ [08:34:12] <_joe_> legoktm: thanks a lot [08:34:17] <_joe_> have a good night [08:34:34] the zuul package is already built for jessie, I'm cautiously optimistic it will be less than 2-3 days [08:34:37] legoktm: good night! [08:34:55] <_joe_> moritzm: I am presenting the worst case scenario [08:35:00] well, the copy will take at least 8 hours due to zillions of small files [08:35:01] sure :-) [08:35:01] _joe_: if we get a server installed, I am confident we can get it back sooner [08:36:03] <_joe_> hashar: ok if this is the case, I'll try to see if that spare is indeed usable now [08:36:34] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2363349 (10hashar) Entirely my fault for not having prepared a proper backup of gallium T80385 and not having moved gallium to another hos... [08:36:48] jynus: that is for /var/lib/jenkins ? Lot of that can be purged [08:37:04] Jenkins saves 5 - 8 files per build and we have it keep 15 to 30 days of history. That pills up [08:37:14] yes, but how to tell? [08:37:41] going over every project and not copying older jobs? [08:37:48] too much work [08:37:52] oh it is readonly isn't ? [08:38:03] you probably can do that after the copy [08:38:12] archiving old jobs to another place [08:38:31] so the main place gets less amount of files [08:38:41] I thought about speeding up the copy [08:38:42] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2363351 (10Joe) I think our best bet at the moment is installing a new system to replace gallium. @hashar suggested moving to jessie dir... [08:38:42] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [08:39:01] PROBLEM - jenkins_zmq_publisher on gallium is CRITICAL: Connection refused [08:39:18] /var/lib/jenkins/jobs/*/builds/ can be skipped entirely for example. That is most of the files [08:40:01] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:40:08] how many different projects are there? [08:40:50] roughly 300 [08:41:26] and there is probably history for 20 k builds each having 5-8 files + various archived files that can be MBytes [08:41:53] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [08:42:09] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2363358 (10Joe) I don't have rights to edit the spares allocation spreadsheet, so I can't comment there, but I am thinking of allocating `... [08:52:13] Hi can we add a notice on https://phabricator.wikimedia.org/ about ci problems [08:52:14] please [08:52:28] since were getting tasks like https://phabricator.wikimedia.org/T137276 [08:53:27] paladox: how would I do that? [08:53:50] jzerebecki: I think if you have permission you would need to edit the dashbored. [08:54:01] I will go and get the link [08:54:48] jzerebecki: https://phabricator.wikimedia.org/dashboard/manage/1/ [08:55:39] jzerebecki: I think we can either add a new panel at the right top or edit this https://phabricator.wikimedia.org/W727 [08:55:45] please [08:56:30] I think none of the possibly awake phab admins are here [08:56:39] jzerebecki: Oh [08:57:13] jzerebecki: Krenair [08:57:16] is an admin [09:03:39] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#1596691 (10Sebastian_Berlin-WMSE) I'd like to join #project-admins to be able to manage #wikispeech, primarily to create sprint projects. [09:05:51] (03PS1) 10Elukey: Limit the maximum broker topic log size to 10TB. [puppet] - 10https://gerrit.wikimedia.org/r/293270 (https://phabricator.wikimedia.org/T136690) [09:06:22] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:07:31] 06Operations, 10Phabricator, 06Release-Engineering-Team: Create a notice panel on phabricator homepage - https://phabricator.wikimedia.org/T137278#2363407 (10Paladox) [09:08:03] 06Operations, 10Phabricator, 06Release-Engineering-Team: Create a notice panel on phabricator homepage - https://phabricator.wikimedia.org/T137278#2363392 (10Paladox) @mmodell or @Aklapper would you be able to do it please. [09:08:18] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Port Zuul package 2.1.0-95-g66c8e52 from Precise to Jessie - https://phabricator.wikimedia.org/T137279#2363409 (10hashar) [09:09:07] andre__: hey can you take a look at T137278 [09:09:07] T137278: Create a notice panel on phabricator homepage - https://phabricator.wikimedia.org/T137278 [09:09:43] 06Operations, 10Phabricator, 06Release-Engineering-Team: Create a notice panel on phabricator homepage - https://phabricator.wikimedia.org/T137278#2363427 (10Paladox) Seems we should create a new project that we lock down to only users can add only users instead of freely allowing joining. But could the rele... [09:09:46] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Port Zuul package 2.1.0-95-g66c8e52 from Precise to Jessie - https://phabricator.wikimedia.org/T137279#2363428 (10hashar) I have spawned a Jessie labs instance `zuul-dev-jessie.integration.eqiad.wmflabs`... [09:10:14] I'll look into T80385 in the mean time, so that we have backups from the start once the new host is live [09:10:21] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [09:15:27] jzerebecki: I'm not really convinced that many people 1) have the default Phabricator dashboard and 2) ever go to the very frontpage of Phabricator [09:16:07] 06Operations, 10Phabricator, 06Release-Engineering-Team: Create a notice panel on phabricator homepage - https://phabricator.wikimedia.org/T137278#2363392 (10Peachey88) If they don't read the relevant mailing lists, what makes you think they will read the front page of phabricator? [09:16:22] andre__: might be. fine by me to not do it. [09:17:30] 06Operations, 10Phabricator, 06Release-Engineering-Team: Create a notice panel on phabricator homepage - https://phabricator.wikimedia.org/T137278#2363461 (10JanZerebecki) I got to it to create a new ticket. Is there a way to add a notice above the create ticket form without being able to edit the rest of th... [09:17:38] 06Operations, 10Phabricator, 06Release-Engineering-Team: Create a notice panel on phabricator homepage - https://phabricator.wikimedia.org/T137278#2363462 (10Paladox) Since they have to create a task like T137276 did. [09:18:49] 06Operations, 10Phabricator, 06Release-Engineering-Team: Create a notice panel on phabricator homepage - https://phabricator.wikimedia.org/T137278#2363463 (10Paladox) @JanZerebecki https://phabricator.wikimedia.org/transactions/editengine/maniphest.task/view/1/ [09:28:44] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2363483 (10hashar) I am rebuilding/testing the Zuul deb package for Jessie (T137279). I have created a placeholder incident report on htt... [09:36:52] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2363502 (10Joe) The host I chose was already allocated to maps100*, so we are now targeting `wmf4746` instead. [09:38:43] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:41] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [09:41:31] (03PS1) 10Aaron Schulz: Set "sync" filebackend replication to measure latency effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293272 [09:44:36] (03CR) 10Ottomata: [C: 031] Limit the maximum broker topic log size to 10TB. [puppet] - 10https://gerrit.wikimedia.org/r/293270 (https://phabricator.wikimedia.org/T136690) (owner: 10Elukey) [09:49:05] (03PS1) 10Ottomata: Include librdkafka-dev in contint::packages::python [puppet] - 10https://gerrit.wikimedia.org/r/293273 (https://phabricator.wikimedia.org/T133779) [09:58:28] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2363564 (10Joe) smartclt status for both disks: - sdc P3220 - sda P3221 [10:02:10] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:04:22] !log aaron@tin Synchronized php-1.28.0-wmf.5/includes/deferred/LinksDeletionUpdate.php: fd44d649787ede78687b4cd2ef21e44a4c8b843b (duration: 00m 33s) [10:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:05:12] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:07:02] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 1.115 second response time [10:07:21] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [10:11:43] 06Operations, 10netops, 10Continuous-Integration-Infrastructure (phase-out-gallium): install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204862 (10Peachey88) Quote from IRC discussing {T137265} replacement box/emergency spares boxes > seems T95959 has... [10:13:57] !log rolling out the new varnishkafka package to cache maps [10:13:57] (03PS1) 10Giuseppe Lavagetto: Add darmstadtium.eqiad.wmnet (Eqiad Row a private) [dns] - 10https://gerrit.wikimedia.org/r/293278 (https://phabricator.wikimedia.org/T137265) [10:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:14:23] (03PS1) 10Giuseppe Lavagetto: contint: add darmstadtium as gallium replacement [puppet] - 10https://gerrit.wikimedia.org/r/293279 (https://phabricator.wikimedia.org/T137265) [10:14:29] <_joe_> hashar: ^^ [10:14:50] <_joe_> I am not sure the partitioning recipe makes sense tbh [10:14:57] darmstadtium instead of contint1001 ? :D [10:14:58] <_joe_> I just reproduced how gallium was done [10:15:13] <_joe_> hashar: yeah let's stick with the naming convention [10:15:21] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:15:37] I dont really care about the hostname, just find contint1001 easier to type/remember figure out what the host is doing [10:15:47] for partitioning, there is no need for SSD /srv/ssd on the new host [10:15:56] that is solely for zuul-merger process and we have another one on scandium [10:15:56] <_joe_> hashar: yeah, I was asking you [10:16:09] <_joe_> are there directories where you expect to use more space [10:16:19] <_joe_> say /var/lib/jenkins and /srv ? [10:16:24] (03PS1) 10Ottomata: Initial debianization and release [debs/python-confluent-kafka] (debian) - 10https://gerrit.wikimedia.org/r/293280 [10:16:29] I think historically we had a raid 1 mirroring of two 500 gb for redundancy [10:16:49] (03CR) 10JanZerebecki: Add darmstadtium.eqiad.wmnet (Eqiad Row a private) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/293278 (https://phabricator.wikimedia.org/T137265) (owner: 10Giuseppe Lavagetto) [10:17:00] /var/lib/jenkins is the bulk of disk space consumption since it got the build logs + artifacts (such as console log, various log files) [10:17:05] srv/ssd isn't part of the initial partioning anyway, it's mounted separately in puppet [10:17:17] /srv/org and similar are the doc.wikimedia.org and integration.wikimedia.org website [10:17:25] the former host a bunch of generated documentation [10:17:35] no idea about their respective usage though [10:18:09] /var/lib/jenkins is probably in the order of 200 Gbytes /srv/org ~50 Gbytes (wild estimates) [10:18:45] <_joe_> I am thinking lvm would serve us better [10:19:40] <_joe_> I'll amend the patch [10:21:11] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [10:21:51] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Port Zuul package 2.1.0-95-g66c8e52 from Precise to Jessie - https://phabricator.wikimedia.org/T137279#2363622 (10hashar) @MoritzMuehlenhoff is taking care of adding the Jenkins 1.652.2 Debian packages fo... [10:22:47] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2363623 (10Ottomata) Let’s do today. Apparently analytics1049 has a bad disk too. Maybe we can do them together! Ping either elukey or I when you are online... [10:36:04] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2363649 (10hashar) @MoritzMuehlenhoff is taking care of adding the Jenkins 1.652.2 Debian packages for jessie-wikime... [10:37:31] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Port Zuul package 2.1.0-95-g66c8e52 from Precise to Jessie - https://phabricator.wikimedia.org/T137279#2363665 (10hashar) p:05High>03Normal The package seems to work fine. I had a dummy zuul layout a... [10:39:21] (03PS2) 10Ottomata: Add ssh:userkey for eventlogging scap::targets [puppet] - 10https://gerrit.wikimedia.org/r/293217 (https://phabricator.wikimedia.org/T137192) (owner: 1020after4) [10:39:46] (03PS3) 10Ottomata: Add ssh:userkey for eventlogging scap::targets [puppet] - 10https://gerrit.wikimedia.org/r/293217 (https://phabricator.wikimedia.org/T137192) (owner: 1020after4) [10:40:47] (03PS2) 10Giuseppe Lavagetto: contint: add contint1001 as gallium replacement [puppet] - 10https://gerrit.wikimedia.org/r/293279 (https://phabricator.wikimedia.org/T137265) [10:40:57] !log uploaded jenkins 1.651.2 for jessie-wikimedia to carbon [10:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:41:12] (03CR) 10Ottomata: "Really not sure how this was ever working in prod or beta. Maybe the user key used to be installed even if manage_user was false? Dunno" [puppet] - 10https://gerrit.wikimedia.org/r/293217 (https://phabricator.wikimedia.org/T137192) (owner: 1020after4) [10:41:29] (03CR) 10Ottomata: [C: 031] Add ssh:userkey for eventlogging scap::targets [puppet] - 10https://gerrit.wikimedia.org/r/293217 (https://phabricator.wikimedia.org/T137192) (owner: 1020after4) [10:42:33] (03PS2) 10Giuseppe Lavagetto: Add contint1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/293278 (https://phabricator.wikimedia.org/T137265) [10:43:16] (03PS3) 10Giuseppe Lavagetto: contint: add contint1001 as gallium replacement [puppet] - 10https://gerrit.wikimedia.org/r/293279 (https://phabricator.wikimedia.org/T137265) [10:43:46] (03CR) 10Giuseppe Lavagetto: Add contint1001.eqiad.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/293278 (https://phabricator.wikimedia.org/T137265) (owner: 10Giuseppe Lavagetto) [10:45:24] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add contint1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/293278 (https://phabricator.wikimedia.org/T137265) (owner: 10Giuseppe Lavagetto) [10:47:25] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2363714 (10jcrespo) [10:47:29] 06Operations, 10DBA: Email spam from some MariaDB's logrotate - https://phabricator.wikimedia.org/T127638#2363711 (10jcrespo) 05Open>03Resolved a:03jcrespo I have deleted all cron jobs on db* hosts trying to rotate mysql logs (we do not allow debian to access mysql in production). We can create later a... [10:47:56] (03PS1) 10Giuseppe Lavagetto: Fix sorting of hostnames [dns] - 10https://gerrit.wikimedia.org/r/293282 [10:49:12] 06Operations, 10DBA: Email spam from some MariaDB's logrotate - https://phabricator.wikimedia.org/T127638#2363720 (10jcrespo) The above tickets exists and it is: T127636 [10:51:49] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2363735 (10Paladox) [10:52:10] (03CR) 10JanZerebecki: Add contint1001.eqiad.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/293278 (https://phabricator.wikimedia.org/T137265) (owner: 10Giuseppe Lavagetto) [10:54:49] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] contint: add contint1001 as gallium replacement [puppet] - 10https://gerrit.wikimedia.org/r/293279 (https://phabricator.wikimedia.org/T137265) (owner: 10Giuseppe Lavagetto) [10:55:39] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:57:39] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.853 second response time [10:57:54] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2363747 (10Paladox) [11:00:03] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jan Dittrich - https://phabricator.wikimedia.org/T136560#2363748 (10jcrespo) Thank you, I can find you now. You are now part of the LDAP group called grafana-admin, and you should be able to login to https://grafana-admin.wikimedia.org with... [11:04:26] (03PS3) 10Hashar: contint: cleanup gallium / use contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/293283 (https://phabricator.wikimedia.org/T137265) [11:04:28] (03PS3) 10Hashar: cache_misc: change doc/integration.wm.o backend [puppet] - 10https://gerrit.wikimedia.org/r/293284 (https://phabricator.wikimedia.org/T137265) [11:04:28] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:06:08] (03CR) 10Hashar: "gallium died and is being replaced by contint1001.eqiad.wmnet . We probably want to wait for the service to be restored before changing th" [puppet] - 10https://gerrit.wikimedia.org/r/293284 (https://phabricator.wikimedia.org/T137265) (owner: 10Hashar) [11:08:17] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 1.935 second response time [11:10:54] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2362975 (10Paladox) [11:12:52] 06Operations, 10Phabricator: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2363764 (10Aklapper) p:05Triage>03Lowest [11:25:43] (03PS1) 10Hashar: zuul.eqiad.wmnet is no more of any use [dns] - 10https://gerrit.wikimedia.org/r/293288 (https://phabricator.wikimedia.org/T137265) [11:26:13] (03PS4) 10Hashar: contint: cleanup gallium / use contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/293283 (https://phabricator.wikimedia.org/T137265) [11:26:48] (03CR) 10Hashar: "In Nodepool configuration I have replaced zuul.eqiad.wmnet in favor of the server hostname. The related DNS entry is dropped via https://g" [puppet] - 10https://gerrit.wikimedia.org/r/293283 (https://phabricator.wikimedia.org/T137265) (owner: 10Hashar) [11:40:13] (03PS1) 10Mobrovac: Set maxClientCnxns to 0 (unlimited) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/293290 [11:41:42] (03PS1) 10Dereckson: Fix whitespace issue [puppet] - 10https://gerrit.wikimedia.org/r/293291 [11:43:39] (03CR) 10Dereckson: contint: add contint1001 as gallium replacement (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293279 (https://phabricator.wikimedia.org/T137265) (owner: 10Giuseppe Lavagetto) [11:48:28] (03CR) 10Ottomata: "One nit, other than that +1" (031 comment) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/293290 (owner: 10Mobrovac) [11:49:42] (03PS1) 10Mobrovac: Change Prop: Fix indentation for transclusions [puppet] - 10https://gerrit.wikimedia.org/r/293292 [11:51:13] 06Operations, 10DNS, 10Phabricator, 10Traffic: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2363915 (10Danny_B) [11:54:57] (03PS2) 10Faidon Liambotis: Reinstate Adam Wight's SSH key "snack" [puppet] - 10https://gerrit.wikimedia.org/r/293259 (https://phabricator.wikimedia.org/T137162) (owner: 10Awight) [11:55:04] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Reinstate Adam Wight's SSH key "snack" [puppet] - 10https://gerrit.wikimedia.org/r/293259 (https://phabricator.wikimedia.org/T137162) (owner: 10Awight) [11:55:49] (03CR) 10Ppchelko: [C: 031] "Ouch.." [puppet] - 10https://gerrit.wikimedia.org/r/293292 (owner: 10Mobrovac) [11:56:15] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: New SSH key for AWight - https://phabricator.wikimedia.org/T137162#2363942 (10faidon) 05Open>03Resolved Oops — I misread the task description (thought you meant "except for…" the new key). Sorry about that! Thanks for putting up a patch even, this... [11:56:40] 06Operations, 10ops-eqiad, 10Analytics-Cluster: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2363964 (10Danny_B) [11:58:29] (03CR) 10BBlack: [C: 031] cache_misc: change doc/integration.wm.o backend [puppet] - 10https://gerrit.wikimedia.org/r/293284 (https://phabricator.wikimedia.org/T137265) (owner: 10Hashar) [11:59:26] (03PS2) 10Faidon Liambotis: Change Prop: Fix indentation for transclusions [puppet] - 10https://gerrit.wikimedia.org/r/293292 (owner: 10Mobrovac) [11:59:33] (03CR) 10Faidon Liambotis: [C: 032] Change Prop: Fix indentation for transclusions [puppet] - 10https://gerrit.wikimedia.org/r/293292 (owner: 10Mobrovac) [11:59:40] (03CR) 10Faidon Liambotis: [V: 032] Change Prop: Fix indentation for transclusions [puppet] - 10https://gerrit.wikimedia.org/r/293292 (owner: 10Mobrovac) [12:00:16] (03CR) 10Elukey: Set maxClientCnxns to 0 (unlimited) (031 comment) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/293290 (owner: 10Mobrovac) [12:04:10] (03CR) 10Ottomata: Set maxClientCnxns to 0 (unlimited) (032 comments) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/293290 (owner: 10Mobrovac) [12:07:33] (03CR) 10Elukey: Set maxClientCnxns to 0 (unlimited) (031 comment) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/293290 (owner: 10Mobrovac) [12:09:08] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2362975 (10Paladox) [12:09:27] (03PS2) 10Mobrovac: Set maxClientCnxns to 0 (unlimited) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/293290 [12:09:56] !log mounted temporarily / partition from gallium sda on db1085:/mnt [12:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:15:05] (03CR) 10Elukey: "Current values:" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/293290 (owner: 10Mobrovac) [12:17:07] (03PS1) 10Muehlenhoff: Define backup for contint [puppet] - 10https://gerrit.wikimedia.org/r/293294 (https://phabricator.wikimedia.org/T80385) [12:17:09] (03PS1) 10Muehlenhoff: Enable backup for contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/293295 (https://phabricator.wikimedia.org/T80385) [12:17:34] (03PS1) 10Ema: Add 'varnish_version' salt grain [puppet] - 10https://gerrit.wikimedia.org/r/293296 (https://phabricator.wikimedia.org/T131499) [12:24:12] (03CR) 10Paladox: [C: 031] cache_misc: change doc/integration.wm.o backend [puppet] - 10https://gerrit.wikimedia.org/r/293284 (https://phabricator.wikimedia.org/T137265) (owner: 10Hashar) [12:27:42] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 1.350 second response time [12:28:00] thnx paravoid for the merge! [12:28:27] (03CR) 10BBlack: [C: 031] Add 'varnish_version' salt grain [puppet] - 10https://gerrit.wikimedia.org/r/293296 (https://phabricator.wikimedia.org/T131499) (owner: 10Ema) [12:29:42] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.021 second response time [12:30:50] (03CR) 10Ottomata: [C: 032 V: 032] Set maxClientCnxns to 0 (unlimited) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/293290 (owner: 10Mobrovac) [12:33:41] (03PS1) 10Mobrovac: Zookeeper: set maxClientCnxns to 0 (aka unlimited) [puppet] - 10https://gerrit.wikimedia.org/r/293298 [12:34:37] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/293296 (https://phabricator.wikimedia.org/T131499) (owner: 10Ema) [12:35:32] (03PS2) 10Ema: Add 'varnish_version' salt grain [puppet] - 10https://gerrit.wikimedia.org/r/293296 (https://phabricator.wikimedia.org/T131499) [12:35:51] (03CR) 10Ema: [C: 032 V: 032] Add 'varnish_version' salt grain [puppet] - 10https://gerrit.wikimedia.org/r/293296 (https://phabricator.wikimedia.org/T131499) (owner: 10Ema) [12:40:58] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:42:47] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.026 second response time [12:42:58] 06Operations, 06Discovery, 06Maps: Tune thread for osm2pgsql / postgres max connections for Maps - https://phabricator.wikimedia.org/T137229#2364124 (10Gehel) I would expect the number of threads and the number of worker to have no direct relation to each other. Especially in node where IO should be async...... [12:43:29] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup payments1005-8 - https://phabricator.wikimedia.org/T136881#2350904 (10Jgreen) [12:43:32] (03PS1) 10Hashar: contint: remove packages from prod slave [puppet] - 10https://gerrit.wikimedia.org/r/293299 [12:48:27] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup payments1005-8 - https://phabricator.wikimedia.org/T136881#2364201 (10Jgreen) [12:50:20] 06Operations, 10Analytics, 10Analytics-Cluster, 06Services: Better monitoring for Zookeeper - https://phabricator.wikimedia.org/T137302#2364233 (10Ottomata) [12:50:30] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup payments1005-8 - https://phabricator.wikimedia.org/T136881#2364247 (10Jgreen) [12:51:46] 06Operations, 10Analytics, 10Analytics-Cluster, 10EventBus, 06Services: Better monitoring for Zookeeper - https://phabricator.wikimedia.org/T137302#2364255 (10mobrovac) [12:52:08] RECOVERY - MD RAID on gallium is OK: OK: Active: 1, Working: 1, Failed: 0, Spare: 0 [12:55:26] (03PS2) 10Hashar: contint: remove packages from prod slave [puppet] - 10https://gerrit.wikimedia.org/r/293299 [12:58:40] (03CR) 10Hashar: [C: 031] "PS2 fix a duplicate definition of 'doxygen' by using require_package." [puppet] - 10https://gerrit.wikimedia.org/r/293299 (owner: 10Hashar) [12:58:41] 06Operations, 10Analytics, 10Analytics-Cluster, 10EventBus, 06Services: Better monitoring for Zookeeper - https://phabricator.wikimedia.org/T137302#2364285 (10Ottomata) The default client connection in our puppet module is no limit. Once we have alerts we should set a limit, pretty high, maybe 2048. [12:59:01] (03PS2) 10Ottomata: Zookeeper: set maxClientCnxns to 0 (aka unlimited) [puppet] - 10https://gerrit.wikimedia.org/r/293298 (owner: 10Mobrovac) [12:59:15] (03CR) 10Ottomata: [C: 032 V: 032] "Once T137302 is done we will set a real limit" [puppet] - 10https://gerrit.wikimedia.org/r/293298 (owner: 10Mobrovac) [12:59:42] (03PS3) 10Giuseppe Lavagetto: contint: remove packages from prod slave [puppet] - 10https://gerrit.wikimedia.org/r/293299 (owner: 10Hashar) [13:00:07] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] contint: remove packages from prod slave [puppet] - 10https://gerrit.wikimedia.org/r/293299 (owner: 10Hashar) [13:03:45] (03CR) 10Eevans: "> LGTM. I have a question about the cassandra common hiera settings" [puppet] - 10https://gerrit.wikimedia.org/r/290860 (owner: 10Eevans) [13:03:52] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2364293 (10Aklapper) Now that [[ https://www.mediawiki.org/wiki/Phabricator/Project_management#Parent_Projects.2C_Sub-projects_and_Milesto... [13:06:09] (03PS1) 10Hashar: contint: hiera conf for contint1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/293301 (https://phabricator.wikimedia.org/T137265) [13:06:51] (03CR) 10Eevans: "Gah, I just noticed that when I updated cassandra-metrics-collector to address the issues we were having here:https://wikitech.wikimedia.o" [puppet] - 10https://gerrit.wikimedia.org/r/290860 (owner: 10Eevans) [13:06:58] (03CR) 10Elukey: "Thanks for the explanation! No objections for AQS." [puppet] - 10https://gerrit.wikimedia.org/r/290860 (owner: 10Eevans) [13:07:05] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] contint: hiera conf for contint1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/293301 (https://phabricator.wikimedia.org/T137265) (owner: 10Hashar) [13:07:07] (03PS2) 10Eevans: filter out new metrics [puppet] - 10https://gerrit.wikimedia.org/r/290860 [13:07:41] (03PS1) 10Giuseppe Lavagetto: contint1001: add role::ci::master [puppet] - 10https://gerrit.wikimedia.org/r/293302 (https://phabricator.wikimedia.org/T137265) [13:07:43] (03PS1) 10Giuseppe Lavagetto: dhcp: fix typo in contint1001 name [puppet] - 10https://gerrit.wikimedia.org/r/293303 [13:08:10] (03PS2) 10Giuseppe Lavagetto: contint1001: add role::ci::master [puppet] - 10https://gerrit.wikimedia.org/r/293302 (https://phabricator.wikimedia.org/T137265) [13:08:17] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] contint1001: add role::ci::master [puppet] - 10https://gerrit.wikimedia.org/r/293302 (https://phabricator.wikimedia.org/T137265) (owner: 10Giuseppe Lavagetto) [13:08:31] !log rolling out new varnishkafka package in cache misc [13:08:31] (03PS2) 10Giuseppe Lavagetto: dhcp: fix typo in contint1001 name [puppet] - 10https://gerrit.wikimedia.org/r/293303 [13:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:08:36] PROBLEM - Hadoop ResourceManager on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager [13:08:42] whaaat? [13:08:45] ottomata: --^ [13:08:52] probably zk? [13:09:00] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:09:10] uh oh, on it [13:09:11] yeah [13:09:41] hehe, elukey 1002 took over as active! :) [13:09:42] elukey: did our vk logging format configs need updating in sync with the new package update? for the time fields? [13:09:59] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=cxserver.svc.codfw.wmnet, port=8080): Max retries exceeded with url: /?spec (Caused by class socket.error: [Errno 111] Connection refused) [13:10:00] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:10:22] (03PS3) 10Giuseppe Lavagetto: dhcp: fix typo in contint1001 name [puppet] - 10https://gerrit.wikimedia.org/r/293303 [13:10:30] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:10:45] RECOVERY - Hadoop ResourceManager on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager [13:10:49] PROBLEM - cxserver endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=8080): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:10:57] bblack: nope it is backward compatible, I will rollout the new config as second step.. tried it yesterday live hacking cp1046 [13:11:19] PROBLEM - MD RAID on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:20] PROBLEM - changeprop endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:20] PROBLEM - mathoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:30] PROBLEM - MegaRAID on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:32] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] dhcp: fix typo in contint1001 name [puppet] - 10https://gerrit.wikimedia.org/r/293303 (owner: 10Giuseppe Lavagetto) [13:11:39] PROBLEM - mathoid endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=10042): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:11:40] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:41] PROBLEM - cxserver endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:50] PROBLEM - salt-minion processes on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:50] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=mobileapps.svc.eqiad.wmnet, port=8888): Max retries exceeded with url: /?spec (Caused by class socket.error: [Errno 111] Connection refused) [13:11:59] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=citoid.svc.codfw.wmnet, port=1970): Max retries exceeded with url: /?spec (Caused by class socket.error: [Errno 111] Connection refused) [13:12:00] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:01] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:02] this looks like ZK restarts [13:12:07] well [13:12:08] uh [13:12:09] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=1970): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:12:12] dunno what is with those ^ [13:12:20] PROBLEM - dhclient process on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:23] mobrovac: ^ [13:12:30] PROBLEM - puppet last run on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:30] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: Generic error: Timeout on connection while downloading http://mobileapps.svc.codfw.wmnet:8888/?spec [13:12:37] did zookeeper restarts cause change prop to get real angry and break stuff? [13:12:40] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:49] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=1970): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:12:49] PROBLEM - changeprop endpoints health on scb2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.132, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:12:49] PROBLEM - mathoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:49] PROBLEM - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=mathoid.svc.eqiad.wmnet, port=10042): Max retries exceeded with url: /?spec (Caused by class socket.error: [Errno 111] Connection refused) [13:12:50] PROBLEM - SSH on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:50] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=citoid.svc.eqiad.wmnet, port=1970): Max retries exceeded with url: /?spec (Caused by class socket.error: [Errno 111] Connection refused) [13:12:54] waat? [13:12:57] _joe_: ^ [13:13:00] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=19000): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:13:09] all of restbase down? [13:13:10] RECOVERY - MD RAID on scb1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [13:13:10] PROBLEM - Disk space on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:11] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:19] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:19] PROBLEM - MD RAID on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:20] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb2001.codfw.wmnet because of too many down!: graphoid_19000 - Could not depool server scb2001.codfw.wmnet because of too many down!: cxserver_8080 - Could not depool server scb2001.codfw.wmnet because of too many down!: citoid_1970 - Could not depool server scb2001.codfw.wmnet because of too many down!: mathoid_10042 - [13:13:29] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=19000): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:13:29] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - cxserver_8080 - Could not depool server scb1002.eqiad.wmnet because of too many down!: citoid_1970 - Could not depool server scb1002.eqiad.wmnet because of too many down!: mathoid_10042 - Could not depool server scb1001.eqiad.wmnet because of too many down!: graphoid_19000 - Could not depool server scb1002.eqiad.wmnet because of too many down!: mobileapps_8 [13:13:29] cpu is at 95% on scb1001 [13:13:31] wth? [13:13:36] urandom --^ [13:13:38] <_joe_> it's all codfw [13:13:42] <_joe_> not eqiad [13:13:45] PROBLEM - LVS HTTP IPv4 on cxserver.svc.codfw.wmnet is CRITICAL: Connection refused [13:13:45] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb2001.codfw.wmnet because of too many down!: graphoid_19000 - Could not depool server scb2001.codfw.wmnet because of too many down!: cxserver_8080 - Could not depool server scb2001.codfw.wmnet because of too many down!: citoid_1970 - Could not depool server scb2001.codfw.wmnet because of too many down!: mathoid_10042 - [13:13:51] <_joe_> also, someone else needs to take a look [13:13:55] _joe_, "Could not depool server scb1002.eqiad.wmnet because of too many down" [13:13:59] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - cxserver_8080 - Could not depool server scb1002.eqiad.wmnet because of too many down!: ores_8081 - Could not depool server scb1001.eqiad.wmnet because of too many down!: citoid_1970 - Could not depool server scb1002.eqiad.wmnet because of too many down!: mathoid_10042 - Could not depool server scb1001.eqiad.wmnet because of too many down!: graphoid_19000 - [13:14:12] <_joe_> akosiaris, paravoid, ema, elukey, bblack [13:14:12] jynus: that just means what it says, the other alternatives were already down too [13:14:13] <_joe_> anyone [13:14:15] PROBLEM - LVS HTTP IPv4 on mobileapps.svc.eqiad.wmnet is CRITICAL: Connection refused [13:14:16] PROBLEM - ores on scb2002 is CRITICAL: Connection refused [13:14:22] PROBLEM - LVS HTTP IPv4 on cxserver.svc.eqiad.wmnet is CRITICAL: Connection refused [13:14:23] we are looking [13:14:23] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:28] <_joe_> mobileapps is now down in eqiad as well [13:14:31] <_joe_> shit [13:14:38] PROBLEM - LVS HTTP IPv4 on citoid.svc.codfw.wmnet is CRITICAL: Connection refused [13:14:44] <_joe_> restbase going down seems like the most probable cause? [13:14:44] _joe_: i just restarted zookeeper! dunno what scbs are doing..., mobrovac is looking [13:14:52] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: puppet fail [13:14:53] wth [13:14:53] zookeeper is this central to everything in realtime? [13:14:54] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - cxserver_8080 - Could not depool server scb1002.eqiad.wmnet because of too many down!: citoid_1970 - Could not depool server scb1002.eqiad.wmnet because of too many down!: mathoid_10042 - Could not depool server scb1001.eqiad.wmnet because of too many down!: graphoid_19000 - Could not depool server scb1002.eqiad.wmnet because of too many down!: mobileapps_8 [13:15:01] can't log in anywhere [13:15:02] PROBLEM - DPKG on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:02] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:02] PROBLEM - configured eth on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:02] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:02] PROBLEM - puppet last run on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:02] cxserver in eqiad too [13:15:09] <_joe_> mobrovac: log in where? [13:15:10] scb that is [13:15:12] RECOVERY - changeprop endpoints health on scb2001 is OK: All endpoints are healthy [13:15:18] PROBLEM - LVS HTTP IPv4 on mathoid.svc.codfw.wmnet is CRITICAL: Connection refused [13:15:18] ha! [13:15:18] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=cxserver.svc.eqiad.wmnet, port=8080): Max retries exceeded with url: /?spec (Caused by class socket.error: [Errno 111] Connection refused) [13:15:19] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:33] PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=mathoid.svc.codfw.wmnet, port=10042): Max retries exceeded with url: /?spec (Caused by class socket.error: [Errno 111] Connection refused) [13:15:34] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:37] <_joe_> I can't login either [13:15:41] likewise [13:15:44] <_joe_> something killed all of those machines [13:15:49] PROBLEM - LVS HTTP IPv4 on citoid.svc.eqiad.wmnet is CRITICAL: Connection refused [13:15:52] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:52] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:58] PROBLEM - LVS HTTP IPv4 on mathoid.svc.eqiad.wmnet is CRITICAL: Connection refused [13:16:03] ottomata: what's the status of zk on conf* ? [13:16:13] PROBLEM - configured eth on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:19] mobrovac: looking fine [13:16:21] scb2001 is probably too cpu-overloaded to allow a login, or OOM [13:16:21] <_joe_> load average: 1221.56, 968.97, 490.39 [13:16:22] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:23] PROBLEM - ores uWSGI web app on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:23] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:25] <_joe_> on scb1002 [13:16:29] PROBLEM - LVS HTTP IPv4 on graphoid.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:30] PROBLEM - configured eth on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:30] PROBLEM - MegaRAID on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:30] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:32] PROBLEM - Check size of conntrack table on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:32] PROBLEM - mathoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:46] logs were busy for a while with accepting connections, that then have expired sessions [13:16:51] and still are a bit [13:16:52] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:02] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:17:03] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:03] PROBLEM - ores uWSGI web app on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:10] <_joe_> I think it's ores [13:17:12] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: Generic error: Timeout on connection while downloading http://graphoid.svc.eqiad.wmnet:19000/?spec [13:17:13] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:13] PROBLEM - ores uWSGI web app on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:13] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:13] PROBLEM - DPKG on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:14] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:17:15] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:17:15] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:17:22] ores recently deployed, good candidate [13:17:23] is scb causing RB problems or the other way around? [13:17:24] <_joe_> can someone powercycle scb1001? [13:17:28] what's going on? [13:17:33] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:17:33] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:34] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:17:34] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:17:34] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:17:34] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:17:35] power cycle would probably do it [13:17:38] on it. [13:17:42] <_joe_> memory exhausted on both scb100* [13:17:42] paravoid: services stuff all around is failing: RB, scb, cxserver, mobileapps, etc [13:17:52] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:17:52] PROBLEM - ores on scb2001 is CRITICAL: Connection refused [13:18:01] <_joe_> rb is not [13:18:02] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:18:03] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:18:07] paravoid: scb is most likely at the root, and most likely ores deploy there screwed those hosts [13:18:08] <_joe_> was a red herring [13:18:08] bblack: the otehr way round [13:18:14] PROBLEM - ores uWSGI web app on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:14] PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=graphoid.svc.codfw.wmnet, port=19000): Max retries exceeded with url: /?spec (Caused by class socket.error: [Errno 111] Connection refused) [13:18:16] <_joe_> rb is just seeing other services fail [13:18:32] <_joe_> so let's first restart scb1001 [13:18:37] <_joe_> and stop ores there [13:18:39] PROBLEM - LVS HTTP IPv4 on graphoid.svc.codfw.wmnet is CRITICAL: Connection refused [13:18:39] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:18:39] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [13:18:42] PROBLEM - Disk space on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:46] !log powercycling scb1001 [13:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:50] there was a traffic spike that preceded the CPU/memory spikes [13:19:07] FWIW, overall public-facing cache_text 5xx isn't that bad, for all the noise above [13:19:12] PROBLEM - Check size of conntrack table on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:19:14] there's a small spike/elevation, but small in the overall [13:19:16] <_joe_> paravoid: so ores spawned a lot of celery workers, from the partial ps -ef I could see [13:19:24] hm [13:19:32] PROBLEM - Check size of conntrack table on scb1001 is CRITICAL: Timeout while attempting connection [13:19:36] this happened almost right after restarted zookeepers [13:19:40] was there something going on ores-related in the last 10, 15 mins? [13:19:43] PROBLEM - MD RAID on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:19:52] <_joe_> mobrovac: a ton of edits would [13:19:54] should I do 2002 as well? [13:19:57] apparently an ores deploy, and a zk restart. independent or related? [13:20:07] ottomata: yeah, it's definitely a correlation [13:20:09] <_joe_> probably the ores deploy [13:20:13] PROBLEM - salt-minion processes on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:20:13] PROBLEM - dhclient process on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:20:22] PROBLEM - DPKG on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:20:23] <_joe_> ottomata: did you log in and stop ores? [13:20:33] PROBLEM - SSH on scb2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:46] <_joe_> I'm doing it [13:20:52] RECOVERY - Disk space on scb2001 is OK: DISK OK [13:20:52] PROBLEM - Disk space on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:20:57] <_joe_> but I don't see it coming back up [13:21:02] _joe_ i only powercycled scb1001 from mgmt console [13:21:04] i couldn't log in [13:21:09] haven't done anything else [13:21:13] PROBLEM - puppet last run on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:15] can't log in either [13:21:31] i can't log into 2002, should I power cycle it? [13:21:32] PROBLEM - SSH on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:32] PROBLEM - puppet last run on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:33] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: Generic error: Timeout on connection while downloading http://mobileapps.svc.codfw.wmnet:8888/?spec [13:21:33] PROBLEM - configured eth on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:33] PROBLEM - dhclient process on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:34] PROBLEM - salt-minion processes on mw1063 is CRITICAL: Connection refused by host [13:21:38] don't powercycle it without a plan :P [13:21:40] <_joe_> mobrovac: it's coming back up [13:21:44] PROBLEM - salt-minion processes on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:47] <_joe_> bblack: I have a plan [13:21:50] ok [13:21:52] PROBLEM - changeprop endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:53] seeing [13:21:53] scb2002 login: [685389.516092] Out of memory: Kill process 121524 (nodejs) score 2 or sacrifice child [13:21:53] [685389.524680] Killed process 121524 (nodejs) total-vm:1307404kB, anon-rss:66024kB, file-rss:544kB [13:21:54] not much showing on the job queue at least [13:21:55] on console [13:22:03] PROBLEM - configured eth on mw1063 is CRITICAL: Connection refused by host [13:22:03] PROBLEM - Check size of conntrack table on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:22:03] I was going to suggest single-user and disabling the service manually before fully coming up [13:22:12] RECOVERY - mathoid endpoints health on scb1001 is OK: All endpoints are healthy [13:22:19] RECOVERY - LVS HTTP IPv4 on mathoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.012 second response time [13:22:20] PROBLEM - dhclient process on mw1063 is CRITICAL: Connection refused by host [13:22:20] PROBLEM - ores on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:34] RECOVERY - configured eth on scb1001 is OK: OK - interfaces up [13:22:34] RECOVERY - MegaRAID on scb1001 is OK: OK: no disks configured for RAID [13:22:35] we're at 20% cpu now [13:22:35] hey what's up ? [13:22:42] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [13:22:59] akosiaris, scb down, afecting multiple services [13:23:00] what's the user-facing impact as far as we know so far? [13:23:03] PROBLEM - puppet last run on mw1063 is CRITICAL: Connection refused by host [13:23:14] mobile apps should fall back, so they're not affected, right? [13:23:18] RECOVERY - LVS HTTP IPv4 on cxserver.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 867 bytes in 0.011 second response time [13:23:25] PROBLEM - LVS HTTP IPv4 on ores.svc.eqiad.wmnet is CRITICAL: Connection refused [13:23:28] <_joe_> paravoid: mobile apps, graphoid, etc were down [13:23:35] PROBLEM - DPKG on mw1063 is CRITICAL: Connection refused by host [13:23:37] RECOVERY - Check size of conntrack table on scb1001 is OK: OK: nf_conntrack is 23 % full [13:23:37] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [13:23:39] paravoid: for overall cache_text 5xx, pretty small elevation on part with usual temporary deployment mistakes. but likely mobileapps, cxserver, some other services were completely dead? [13:23:42] content translation is user-facing, VE would have issues because of citoid [13:23:45] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 27 minutes ago with 0 failures [13:23:45] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [13:23:46] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [13:23:46] PROBLEM - Disk space on mw1063 is CRITICAL: Connection refused by host [13:23:46] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [13:23:48] s/on part/on par/ [13:23:49] <_joe_> akosiaris: it seems ores killed all the scb* cluster machines [13:23:57] PROBLEM - MD RAID on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:05] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [13:24:05] RECOVERY - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is OK: All endpoints are healthy [13:24:06] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [13:24:07] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [13:24:07] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [13:24:09] how on earth did that happen ? [13:24:15] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [13:24:23] <_joe_> akosiaris: I counted at least 20 celery workers [13:24:23] currently the elevation is still there, but it's mostly transitioned to 504 (instead of 503 or 500), which I think is RB saying underlying service problem. [13:24:30] <_joe_> using 3% of memory each [13:24:32] RECOVERY - LVS HTTP IPv4 on citoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 921 bytes in 0.004 second response time [13:24:33] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [13:24:33] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [13:24:47] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [13:24:56] <_joe_> so, I guess we should powercycle the other machine (scb1002) too [13:25:01] 2002 is doing it too [13:25:02] RECOVERY - LVS HTTP IPv4 on graphoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 916 bytes in 0.011 second response time [13:25:02] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [13:25:03] +1 [13:25:03] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [13:25:05] i'm in 2002 now [13:25:09] ok to powercycle it? [13:25:12] <_joe_> why 2002? :P [13:25:20] <_joe_> it's the inactive dc [13:25:28] it is angry too! [13:25:29] [685603.898208] Killed process 120914 (nodejs) total-vm:1308512kB, anon-rss:64364kB, file-rss:1960kB [13:25:30] <_joe_> we might want to see what the hell is happening there [13:25:30] still, should stop the madness there too [13:25:31] and also alerting [13:25:33] should I stop ores as a preventive measure? [13:25:36] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [13:25:36] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy [13:25:36] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [13:25:37] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [13:25:37] ok, let's leave codfw up [13:25:51] RECOVERY - LVS HTTP IPv4 on ores.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2612 bytes in 0.007 second response time [13:25:53] we can stop one of the pair in codfw and leave the other in the bad state for investigation [13:25:58] assuming we can even log in to investigate [13:26:03] memory is ok for now [13:26:06] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [13:26:09] shall I powercycle 1002? [13:26:17] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: Generic error: Timeout on connection while downloading http://mobileapps.svc.codfw.wmnet:8888/?spec [13:26:20] <_joe_> I vote yes [13:26:23] ok doing [13:26:30] !log powercycling scb1002 [13:26:32] <_joe_> akosiaris: can you take a look at the state of ores on 1001? [13:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:26:37] <_joe_> I think I stopped it all [13:26:50] nope, it's running [13:26:57] uwsgi is still up [13:26:59] strontium failed to pull in someone's git change 13m ago [13:27:17] but I don't see it causing problems [13:27:18] probably unrelated [13:27:26] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [13:27:28] <_joe_> akosiaris: I turned off celery [13:27:30] ores on scb1001 I mean [13:27:36] mobileapps on the other hand [13:27:42] RECOVERY - LVS HTTP IPv4 on mobileapps.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 960 bytes in 0.007 second response time [13:27:44] is causing 100% cpu right now [13:27:45] <_joe_> what's up with mobileapps? [13:27:45] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [13:27:45] oh it's both strontium + palladium, probably someone didn't puppet-merge because the above started happening [13:27:55] RECOVERY - Check size of conntrack table on scb2001 is OK: OK: nf_conntrack is 0 % full [13:27:56] RECOVERY - DPKG on scb2001 is OK: All packages OK [13:27:58] <_joe_> yes [13:28:02] <_joe_> mobrovac: ^^ [13:28:24] looking [13:28:26] RECOVERY - MD RAID on scb2001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [13:28:35] <_joe_> akosiaris: turn it off I'd say :) [13:28:42] mobileapps ? [13:29:02] mw1063 is unresposive, not sure if related [13:29:03] sounds like a good idea now that I look at it [13:29:14] no no, better to turn cp off [13:29:15] RECOVERY - configured eth on scb2001 is OK: OK - interfaces up [13:29:15] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [13:29:29] !log stopping changeprop on scb1001 [13:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:29:35] RECOVERY - Check size of conntrack table on scb1002 is OK: OK: nf_conntrack is 0 % full [13:29:35] RECOVERY - salt-minion processes on scb1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:29:55] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [13:29:57] what is missing with user impact? [13:30:00] 06Operations, 10cassandra: 1000+ keyspace metrics you didn't see coming - https://phabricator.wikimedia.org/T137304#2364326 (10Eevans) [13:30:02] !log change-prop stopped on scb1002 [13:30:05] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:30:05] RECOVERY - dhclient process on scb1002 is OK: PROCS OK: 0 processes with command name dhclient [13:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:08] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2364339 (10hashar) [13:30:13] (03PS1) 10Giuseppe Lavagetto: role::ci::master: require role::ci::website [puppet] - 10https://gerrit.wikimedia.org/r/293304 [13:30:14] ok [13:30:16] RECOVERY - DPKG on scb1002 is OK: All packages OK [13:30:16] RECOVERY - configured eth on scb1002 is OK: OK - interfaces up [13:30:16] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 52 minutes ago with 0 failures [13:30:36] RECOVERY - mathoid endpoints health on scb1002 is OK: All endpoints are healthy [13:30:37] RECOVERY - SSH on scb1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [13:30:41] !log disabling puppet on scb1001 & scb1002 [13:30:41] too many reqs being processed by change-prop to keep up, will lower the concurrency limit [13:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:52] <_joe_> jynus: mw1063 is decommissioned, I think it has to do with the wrong dhcp entries for the new appservers [13:30:56] RECOVERY - Disk space on scb1002 is OK: DISK OK [13:31:03] with changeprop down things seem to have calmed down [13:31:05] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [13:31:06] RECOVERY - MD RAID on scb1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [13:31:06] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy [13:31:12] _joe_, thanks, then let's forget about it for now [13:31:25] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2364341 (10Paladox) [13:31:26] RECOVERY - MegaRAID on scb1002 is OK: OK: no disks configured for RAID [13:31:27] RECOVERY - ores uWSGI web app on scb1002 is OK: ● uwsgi-ores.service - uwsgi-ores uwsgi app [13:31:28] graphoid seems now up, which was the last thing down? [13:31:52] (03CR) 10Giuseppe Lavagetto: [C: 032] role::ci::master: require role::ci::website [puppet] - 10https://gerrit.wikimedia.org/r/293304 (owner: 10Giuseppe Lavagetto) [13:32:00] (03CR) 10Giuseppe Lavagetto: [V: 032] role::ci::master: require role::ci::website [puppet] - 10https://gerrit.wikimedia.org/r/293304 (owner: 10Giuseppe Lavagetto) [13:32:05] akosiaris: since you disabled puppet, could you s/100/50/ /etc/changeprop/config.yaml on scb100x so that we see if that's enough ? [13:32:09] PROBLEM - LVS HTTP IPv4 on mobileapps.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:24] RECOVERY - ores on scb1002 is OK: HTTP OK: HTTP/1.0 200 OK - 2801 bytes in 0.021 second response time [13:32:31] <_joe_> mobrovac: also, why is changeprop killing the mobileapps in codfw too [13:32:34] <_joe_> ? [13:32:42] that's not possible [13:33:06] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2364351 (10Paladox) [13:33:09] RECOVERY - LVS HTTP IPv4 on mobileapps.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 960 bytes in 0.102 second response time [13:33:09] RECOVERY - Disk space on scb2002 is OK: DISK OK [13:33:12] mobrovac: in zk logs incodfw i see thigns like [13:33:13] PROBLEM - ores uWSGI web app on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:15] checking now phab/wikis for user reports [13:33:17] <_joe_> mobrovac: evidence is, it's happening [13:33:18] 2016-06-08 13:30:20,971 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception [13:33:18] EndOfStreamException: Unable to read additional data from client sessionid 0xd355301d8c591dc2, likely client has closed socket [13:33:18] at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) [13:33:18] at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) [13:33:18] at java.lang.Thread.run(Thread.java:745) [13:33:18] 2016-06-08 13:30:20,971 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /10.192.32.132:35818 which had sessionid 0xd355301d8c591dc2 [13:33:45] k, lemme stop cp there [13:33:46] although that was 4 mins ago and nothing since [13:33:47] ok, I am thinking the zookeeper restart and changeprop going beserk are related. [13:34:01] ottomata: can you check the queue sizes in codfw for codfw.* topics? [13:34:15] PROBLEM - configured eth on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:22] ottomata: the last ZK change removed the per IP max-conns limits that changeprop was suffering for right? [13:34:49] akosiaris: yes, but that doesn't explain scb going completely down, right before the inactivity, top showed changeprop was using <1% per worker [13:35:09] and a normal avg consumption rate [13:35:19] elukey: yes, but i doubt that change prop all the sudden created so many more connections that would do this because of that [13:35:31] elukey: more likely it was the zk restart itself that made something angry [13:35:33] yeah, super weird [13:35:34] Mobrovac maybe it affected the backup servers, too, since the primary servers were down? [13:35:34] PROBLEM - DPKG on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:44] mobrovac: queue sizes? [13:35:44] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Could not fetch url http://mobileapps.svc.codfw.wmnet:8888/en.wikipedia.org/v1/page/media/Cat: Timeout on connection while downloading http://mobileapps.svc.codfw.wmnet:8888/en.wikipedia.org/v1/page/media/Cat: /{domain}/v1/page/mobile-summary/{title} (retr [13:35:56] oh like message in? [13:36:04] elukey: ottomata: yeah, it was probably the restart, otherwise the zk hosts would have suffered too [13:36:05] PROBLEM - MD RAID on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:36:06] very very few mobrovac [13:36:13] RECOVERY - configured eth on scb2001 is OK: OK - interfaces up [13:36:18] but whyare there any in the first place? [13:36:37] mobrovac: just the spec tests i think [13:36:40] bearND|afk: no, the fail-over is not automatic [13:36:41] ja [13:36:46] only codfw.test.event [13:36:50] ah ok [13:36:54] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:36:55] no change prop events [13:37:23] PROBLEM - Check size of conntrack table on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:34] still can't log in on scb200x [13:37:46] didn't powercycle the codfw ones [13:37:57] ottomata: ok, so selecting codfw in https://grafana-admin.wikimedia.org/dashboard/db/eventbus doesn't actually work [13:38:04] (03PS1) 10Giuseppe Lavagetto: role::ci::master: hotfix for /srv/ssd [puppet] - 10https://gerrit.wikimedia.org/r/293306 [13:38:06] _joe_: suggested leaving them up to try to investigate somehow [13:38:20] <_joe_> ottomata: no need if we got the culprit [13:38:28] <_joe_> let's powercycle those as well [13:38:38] mobrovac: it works for me [13:38:53] PROBLEM - Disk space on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:54] PROBLEM - Disk space on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:55] i guess it only changes the kafka message / sec [13:39:09] the event bus post stuff is aggregated statsd, no cluster name in the metrics [13:39:29] dunno about any change prop stuff [13:39:29] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::ci::master: hotfix for /srv/ssd [puppet] - 10https://gerrit.wikimedia.org/r/293306 (owner: 10Giuseppe Lavagetto) [13:39:37] we prob should put change prop stuff on its own dash [13:40:24] is ores running on scb100x now? [13:40:25] ok, mobrovac shall I powercycle 2001 and 2002? [13:40:36] yes, please, no use of them like this [13:40:38] (03CR) 10Eevans: "Other than parameterizing the contacts, I don't have any ideas. Is there an alternative to encapsulating the check in the module, and if " [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [13:41:51] ottomata: do you need any help with pc? [13:41:59] I can take care of 2002 if you want [13:42:00] !log powercycling scb2001 and scb2002 [13:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:42:05] ah, s'ok elukey just did [13:42:09] super [13:42:26] (03PS1) 10Mobrovac: Change Prop: Lower the concurrency limit to 30 [puppet] - 10https://gerrit.wikimedia.org/r/293307 [13:42:41] akosiaris: _joe_: ^ let's go with this limit [13:42:55] i got yo quick merge mobrovac [13:43:15] (03CR) 10Ottomata: [C: 032 V: 032] Change Prop: Lower the concurrency limit to 30 [puppet] - 10https://gerrit.wikimedia.org/r/293307 (owner: 10Mobrovac) [13:43:32] akosiaris: will a puppet run bring ores back? [13:43:33] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=mobileapps.svc.codfw.wmnet, port=8888): Max retries exceeded with url: /?spec (Caused by class socket.error: [Errno 113] No route to host) [13:43:55] RECOVERY - MD RAID on scb2001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [13:44:13] RECOVERY - SSH on scb2002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [13:44:13] RECOVERY - SSH on scb2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [13:44:20] RECOVERY - LVS HTTP IPv4 on cxserver.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 867 bytes in 0.095 second response time [13:44:20] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [13:44:21] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy [13:44:23] need coffee, gonna walk to cafe, be back shortly [13:44:25] !log running fsck.ext3 /dev/sda2 in read-write mode for gallium [13:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:44:34] RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 31 minutes ago with 0 failures [13:44:34] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [13:44:34] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [13:44:40] RECOVERY - LVS HTTP IPv4 on graphoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 916 bytes in 0.077 second response time [13:44:43] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [13:44:54] RECOVERY - Disk space on scb2002 is OK: DISK OK [13:44:54] RECOVERY - Disk space on scb2001 is OK: DISK OK [13:44:55] RECOVERY - configured eth on scb2002 is OK: OK - interfaces up [13:44:55] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [13:44:55] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [13:45:03] RECOVERY - cxserver endpoints health on scb2001 is OK: All endpoints are healthy [13:45:09] RECOVERY - LVS HTTP IPv4 on mathoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.094 second response time [13:45:10] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [13:45:10] RECOVERY - dhclient process on scb2001 is OK: PROCS OK: 0 processes with command name dhclient [13:45:10] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 47 minutes ago with 0 failures [13:45:13] RECOVERY - cxserver endpoints health on scb2002 is OK: All endpoints are healthy [13:45:20] RECOVERY - LVS HTTP IPv4 on citoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 921 bytes in 0.093 second response time [13:45:33] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [13:45:34] RECOVERY - salt-minion processes on scb2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:45:34] RECOVERY - Check size of conntrack table on scb2001 is OK: OK: nf_conntrack is 0 % full [13:45:34] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [13:45:35] RECOVERY - salt-minion processes on scb2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:45:35] RECOVERY - dhclient process on scb2002 is OK: PROCS OK: 0 processes with command name dhclient [13:45:35] RECOVERY - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is OK: All endpoints are healthy [13:45:43] RECOVERY - mathoid endpoints health on scb2002 is OK: All endpoints are healthy [13:45:43] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy [13:45:43] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [13:45:43] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [13:45:44] RECOVERY - DPKG on scb2001 is OK: All packages OK [13:45:44] RECOVERY - Check size of conntrack table on scb2002 is OK: OK: nf_conntrack is 0 % full [13:45:45] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [13:45:53] akosiaris: ping? [13:45:53] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [13:46:03] RECOVERY - DPKG on scb2002 is OK: All packages OK [13:46:04] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [13:46:04] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [13:46:05] RECOVERY - mathoid endpoints health on scb2001 is OK: All endpoints are healthy [13:46:13] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [13:46:23] RECOVERY - MD RAID on scb2002 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [13:46:23] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [13:46:33] RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy [13:46:33] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [13:46:33] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy [13:46:34] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [13:46:43] PROBLEM - Host mw1063 is DOWN: PING CRITICAL - Packet loss = 100% [13:46:44] RECOVERY - ores on scb2001 is OK: HTTP OK: HTTP/1.0 200 OK - 2801 bytes in 0.127 second response time [13:47:21] when did the outage start? [13:47:48] I am seeing a huge amount of traffic coming through conf100[123] starting ~40 minutes ago [13:47:55] and now ok [13:48:03] RECOVERY - changeprop endpoints health on scb2002 is OK: All endpoints are healthy [13:48:05] RECOVERY - ores uWSGI web app on scb2002 is OK: ● uwsgi-ores.service - uwsgi-ores uwsgi app [13:48:14] (03PS1) 10Giuseppe Lavagetto: role::ci::website: require role::ci::slave [puppet] - 10https://gerrit.wikimedia.org/r/293308 [13:48:19] https://grafana.wikimedia.org/dashboard/db/server-board -> conf1001 [13:48:24] RECOVERY - ores on scb2002 is OK: HTTP OK: HTTP/1.0 200 OK - 2801 bytes in 0.088 second response time [13:48:38] <_joe_> those hosts also keep our etcd cluster [13:48:46] <_joe_> should I move it away? [13:48:53] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [13:48:56] <_joe_> I was reassured zk is not doing too much traffic [13:49:12] _joe_ it is worth to think about it [13:49:15] (03PS2) 10Giuseppe Lavagetto: role::ci::website: require role::ci::slave [puppet] - 10https://gerrit.wikimedia.org/r/293308 [13:50:05] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [13:50:06] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::ci::website: require role::ci::slave [puppet] - 10https://gerrit.wikimedia.org/r/293308 (owner: 10Giuseppe Lavagetto) [13:51:10] _joe_: IMHO for our confd/etcd stuff that drives service-state, it probably should be elsewhere. I think at one point in the distant past, I was thinking it could be co-located with the LVS servers themselves. [13:51:27] (as in deploy etcd servers on redundant pairs of LVSes in each DC) [13:51:42] but then that gets tricky with restricting access and all that [13:52:20] so LVS isn't a great home either. but it would be nice for it to be a bit isolated from other general things that could fail or saturate [13:52:43] akosiaris: am i ok to run puppet on scb? [13:52:44] (since we potentially need it to do manual depools when other things are broken) [13:53:46] (03PS1) 10Giuseppe Lavagetto: role::ci::slave: require /srv/ssd as a file, not mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/293309 [13:54:08] mobrovac: https://graphite.wikimedia.org/S/BZ [13:54:17] (03PS2) 10Giuseppe Lavagetto: role::ci::slave: require /srv/ssd as a file, not mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/293309 [13:55:32] right after the zk restarts tons of traffic landed to conf100[123] [13:55:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::ci::slave: require /srv/ssd as a file, not mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/293309 (owner: 10Giuseppe Lavagetto) [13:55:55] but I am not sure what was that stuff [13:56:25] (03CR) 10Eevans: "See also: https://phabricator.wikimedia.org/T137304 (for cleaning up the extra metrcs)" [puppet] - 10https://gerrit.wikimedia.org/r/290860 (owner: 10Eevans) [13:56:29] might have been connections stuck accumulating [13:56:47] but those metrics are traffic bits [13:57:03] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [13:57:44] elukey: hm, changeprop had a spike between 13:20 and 13:30 utc, whereas zk shows spikes before that [13:58:04] yeah [13:58:15] right after the merge [13:58:16] (03PS1) 10Krinkle: Bump wgResourceLoaderStorageVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293310 [13:58:38] elukey: possibly the merge causing uncoordinated zk restarts? [13:58:46] when i applied the patch in beta, puppet restarted zk [13:58:57] which is probably not a good idea to begin with [13:59:09] let me check puppet [13:59:28] damn it akosiaris, where are you? [14:00:06] yeah it has hasrestart => true [14:00:18] with subscribe to File['/etc/zookeeper/conf/zoo.cfg'], [14:00:19] i see ores running on scb, but puppet is disabled [14:00:26] and i need to run it for cp [14:00:45] elukey: uh, very bad, if you ask me [14:01:18] we also don't have a lot of metrics about zk too (I mean service ones) [14:01:20] that could explain those spikes - three spikes, three zk restarts [14:01:35] correspondent to puppet runs [14:01:39] let's check syslog [14:01:50] actually, there are 5 of them - 3 for the puppet runs and 2 for otto restarting by hand [14:02:54] on conf1001 [14:02:55] Jun 8 13:00:41 conf1001 puppet-agent[10027]: (/Stage[main]/Zookeeper::Server/Service[zookeeper]) Triggered 'refresh' from 1 events [14:04:04] conf1002 [14:04:04] Jun 8 13:00:35 conf1002 puppet-agent[24349]: (/Stage[main]/Zookeeper::Server/Service[zookeeper]) Triggered 'r [14:04:08] efresh' from 1 events [14:05:23] _joe_ just as FYI, I am seeing a lot of these [14:05:23] PROBLEM - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused [14:05:24] conf1002 etcd[18215]: auth: no authorization provided, checking guest access [14:05:38] in syslog.. might be nothing but I wanted to let you know [14:05:52] elukey: zk is restarting on puppet run now? [14:06:02] refresh? [14:06:16] hmmm [14:06:34] oh man [14:07:19] oh no that's from my run earlier [14:07:20] hm [14:07:45] mobrovac, i wonder if things got really unhappy because I ran puppet on all 3 nodes at the same time! I didn't realize/remember that zk subscribed its configs... [14:07:46] that is not good [14:07:49] i'm fixing that now [14:09:04] i'd say highly likely ottomata [14:09:53] I think that we moved from zk used only by hadoop to kafka and finally change prop, without properly reviewing the service (that is now REALLY important) [14:10:43] elukey: zk is used by kafka, which is why cp needs it [14:10:50] kafka first, but ja [14:10:59] if cp was using newer kafka client, it wouldnt' need it :p [14:11:21] mobrovac: yes sorry I didn't specify :) [14:11:22] ottomata: it's no time for trolling now, seriously [14:11:48] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2364449 (10Danny_B) @mmodell Re the topic above: Can actually creating of milestones be separated right from creating projects? [14:12:00] mobrovac: what about if we try to write down a timeline? [14:12:34] elukey: sure, but let's discuss it after we have solved the problem at hand first [14:13:09] (03PS1) 10Ottomata: Remove subscribe from zookeeper server [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/293312 [14:14:22] * mobrovac still pinging akosiaris [14:15:21] (03CR) 10Ottomata: [C: 032 V: 032] Remove subscribe from zookeeper server [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/293312 (owner: 10Ottomata) [14:15:57] so are we still unsure if ores deploy is related? [14:16:00] or is that ruled out? [14:16:07] we dunno [14:16:13] akosiaris doesn't seem to be around [14:16:26] he deployed it? [14:16:37] that's what _joe_ said [14:16:37] ottomata: https://graphite.wikimedia.org/S/BZ [14:17:04] interesting. [14:17:28] there were lots of expired sessions and disconnected clients when i was restarting zk and watching logs for a while [14:17:58] hm so [14:18:14] <_joe_> elukey: that's ok [14:18:19] <_joe_> the etcd message [14:18:24] super [14:18:28] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2364497 (10Paladox) [14:18:48] 1. probably all 3 zks were restarted at the same time by puppet - that was bad [14:18:48] 2. change-prop should not kill everything if it can't talk to zookeeper (well) [14:18:54] (03PS1) 10Ottomata: Update zookeeper submodule with change to no longer subscribe to config files [puppet] - 10https://gerrit.wikimedia.org/r/293315 [14:19:01] elukey: 1. is probably why the rm on analytics1001 died [14:19:02] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:12] yeah [14:19:20] analytics1001 was the first one to fall [14:19:41] aye, but thanks to your auto failover work, 1002 just stepped right in :) [14:19:56] i think 1002 is still the active rm [14:20:21] well we can switch it back with a simple restart later on [14:20:27] or maybe with your commands [14:20:42] i think restart is the better way [14:20:44] but ja [14:20:52] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.022 second response time [14:20:55] for rm [14:20:56] but ja [14:21:04] mobrovac: I know that you are waiting alex but can we do something at the moment? [14:21:15] you are waiting to run puppet right? [14:21:36] elukey: not sure, alex disabled puppet on scb, and we need to run it [14:21:41] elukey: oh, there is! [14:22:14] elukey: someone with root on scb could manually change the config, and then we can bring cp up on one node [14:22:21] (03PS1) 10BBlack: ssl sid cache: 15m -> 75m [puppet] - 10https://gerrit.wikimedia.org/r/293316 [14:22:57] mmmmmmmmm [14:23:10] ? [14:23:23] it was a thought [14:23:50] it would be super great to use puppet atm, without doing anything risky.. even if it sounds the only thing to do [14:24:18] elukey: the idea is to manually change the config just as puppet would [14:24:25] since we're blocked on puppet being disabled [14:24:47] (03CR) 10BBlack: [C: 032 V: 032] ssl sid cache: 15m -> 75m [puppet] - 10https://gerrit.wikimedia.org/r/293316 (owner: 10BBlack) [14:25:52] mobrovac: can we do it on codfw first? [14:26:11] elukey: that would do us no good - there are no messages being produced there ... [14:26:29] yeah but just to check [14:26:56] (03PS2) 10Ottomata: Update zookeeper submodule with change to no longer subscribe to config files [puppet] - 10https://gerrit.wikimedia.org/r/293315 [14:27:01] (03PS1) 10Giuseppe Lavagetto: role::contint::*: try to resolve circular dependencies correctly [puppet] - 10https://gerrit.wikimedia.org/r/293319 [14:27:04] (03CR) 10Ottomata: [C: 032 V: 032] Update zookeeper submodule with change to no longer subscribe to config files [puppet] - 10https://gerrit.wikimedia.org/r/293315 (owner: 10Ottomata) [14:27:33] elukey: sure, let's try [14:27:53] scb2001.codfw.wmnet ? [14:28:03] elukey: sure, why not [14:28:43] and I guess /etc/changeprop/config.yaml [14:28:44] elukey: oh actually no need for manual intervention there [14:28:49] puppet ran there [14:28:55] ah yes [14:28:58] it hasn't been disabled in codfw [14:29:05] ok, i'll start cp on scb2001 [14:29:52] lots of logs in codfw zk [14:30:09] things like [14:30:10] 2016-06-08 14:29:44,015 - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0xd355301d8c595990, timeout of 30000ms exceeded [14:30:26] 2016-06-08 14:30:14,078 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client: 0xd155301d8d5e56c5 [14:30:26] 2016-06-08 14:30:14,078 - INFO [QuorumPeer[myid=2003]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@588] - Invalid session 0xd155301d8d5e56c5 for client /10.192.48.43:38032, probably expired [14:30:37] hmm, looks like cp is running there [14:30:41] i'll restart it [14:30:47] 2016-06-08 14:29:45,189 - INFO [ProcessThread(sid:2002 cport:-1)::PrepRequestProcessor@627] - Got user-level KeeperException when processing sessionid:0xd355301d8c595aaa type:create cxid:0x1a zxid:0x300061977 txntype:-1 reqpath:n/a Error Path:/kafka/main-codfw/consumers/change-prop-change-prop.retry.mediawiki.revision_visibility_set-revision_visibility_change/ids Error:KeeperErrorCode = NodeExists for /kafka/main-codfw/consumers/change [14:32:41] (brb) [14:32:57] (03PS2) 10Giuseppe Lavagetto: role::contint::*: try to resolve circular dependencies correctly [puppet] - 10https://gerrit.wikimedia.org/r/293319 [14:33:48] elukey: ottomata: it seems like cp can't connect to zk in codfw [14:33:55] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::contint::*: try to resolve circular dependencies correctly [puppet] - 10https://gerrit.wikimedia.org/r/293319 (owner: 10Giuseppe Lavagetto) [14:34:09] mobrovac: yeah something about invalid sessions? [14:34:21] a restart of cp didn't help [14:34:35] i think cp is trying to access znodes in zk that don't exist? [14:35:16] Error:KeeperErrorCode = NodeExists for /kafka/main-codfw/consumers [14:35:17] hmmm [14:35:23] looking in zk [14:36:42] mobrovac: something is def strange [14:36:45] those znodes exist [14:36:54] oh sorry [14:36:56] NodeExists! [14:37:00] dunno why I read that as no node exists [14:37:04] yeah, that's ok] [14:37:09] that's ok? [14:37:09] (03PS2) 10Krinkle: zuul.eqiad.wmnet is no more of any use [dns] - 10https://gerrit.wikimedia.org/r/293288 (https://phabricator.wikimedia.org/T137265) (owner: 10Hashar) [14:37:15] ok. [14:37:21] lots of CONNECTION_LOSS on the cp end though [14:37:28] but some managed to connect [14:37:30] mobrovac: i was going to guess that cp was getting exception when trying to create a znode that exists or something [14:37:35] ya hm [14:37:37] (03CR) 10Krinkle: "(Don't merge before I0a4722b509cfe76e is merged or similar patch that removes the reference)" [dns] - 10https://gerrit.wikimedia.org/r/293288 (https://phabricator.wikimedia.org/T137265) (owner: 10Hashar) [14:38:04] o/ Hey folks. Just saw all the pings re. ORES in scb. [14:38:12] * halfak needs to set up paging to make it to his phone [14:38:21] I have it for the labs deploy, but not SCB yet [14:38:22] (03PS2) 10BBlack: tlsproxy: turn proxy_request_buffering off [puppet] - 10https://gerrit.wikimedia.org/r/287996 [14:38:24] Anything I can do? [14:38:51] mobrovac: https://gist.github.com/ottomata/7021f99132f44ec329d7835c96ccdfa6 [14:39:23] halfak: an ores deploy and a zookeeper restart happened at about the same time, and then after that there was a service cluster outage [14:39:29] we think it probably isn't related to the ores deploy [14:39:40] we aren't certain but prob isn't [14:39:46] OK great. Was going to be surprised that ORES hurt anything else. [14:39:56] Will be around for the next hour at least. [14:40:14] ottomata: any news? [14:40:14] Let me know if you'd like to me to look into anything in particular [14:40:18] mobrovac: i dunno what the renew / expire session is about [14:40:30] me neither [14:40:37] akosiaris: mobrovac started changeprop in codfw, and it doesn't look happy [14:40:54] better to see it unhappy in there [14:40:59] :) [14:41:20] ok, so we 've narrowed it down to changeprop [14:41:20] ottomata: elukey: i wouldn't say it doesn't look happy [14:41:26] no ottomata [14:41:27] probably at least [14:41:36] (03PS1) 10Giuseppe Lavagetto: jenkins: declare the home directory explicitly [puppet] - 10https://gerrit.wikimedia.org/r/293321 [14:41:52] <_joe_> hashar: ^^ pls confirm that's correct [14:42:07] ottomata: as i said, these errors happen in cp, caused by workers trying to steal work from each other [14:42:17] <_joe_> reason is, you declare a .gitconfig in role::ci::slave which assumes that [14:42:33] akosiaris_: Marko would like to run puppet on scb eqiad to add https://gerrit.wikimedia.org/r/#/c/293307/ [14:42:36] akosiaris_: are you doing anything on scb? why is puppet disabled there? [14:42:42] exactly [14:42:59] mobrovac: i thought you said cp in codfw coudlnt' connect to zk? [14:43:02] _joe_: looking [14:43:05] or, at least, conns were dying [14:43:32] (03PS2) 10Giuseppe Lavagetto: jenkins: declare the home directory explicitly [puppet] - 10https://gerrit.wikimedia.org/r/293321 [14:43:32] mobrovac: I disabled it as a precaution [14:43:47] feel free to enable [14:43:51] ottomata: there was a lot of noise, but the restart seems to have stabilised it [14:43:54] kk akosiaris_ [14:43:55] oh ok [14:44:21] !log scb1001 enabling and running puppet on scb1001 [14:44:24] elukey: yeah ok on my part [14:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:53] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 2 failures [14:45:12] akosiaris_: shall we re-enable puppet on one of the scb eqiad host? [14:45:25] ah yes just seen your comments [14:45:32] (03CR) 10Hashar: "Yup it is fine. Only used on production, I guess I wrote the puppet manifest after jenkins user got already created ages ago." [puppet] - 10https://gerrit.wikimedia.org/r/293321 (owner: 10Giuseppe Lavagetto) [14:45:33] ok, lots of those same logs rolling in in eqiad zk [14:45:33] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [14:45:47] _joe_: https://gerrit.wikimedia.org/r/#/c/293321/ yeah good to go [14:45:56] _joe_: we had Jenkins installed first (and thus jenkins user) then later added puppet recipes [14:46:02] ok, here we go, let's see [14:46:13] 06Operations, 06Labs: Changing username on WikiTech - https://phabricator.wikimedia.org/T137315#2364589 (10Soni) [14:46:16] so /var/lib/jenkins has always been around. Though in theory one could require the package['jenkins'] [14:46:19] so, ok mobrovac so the NodeExists stuff is normal for cp starting up, as it is balancing workings? [14:46:20] workers* [14:46:37] yes [14:46:39] k [14:46:50] makes sense [14:46:53] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:46:57] _joe_: err forget me.. that is jenkins-slave user .. I have no idea how we had it created maybe manuall [14:47:04] would maybe be better to check if it exists before attempting to create? i guess..that is, if you are doing it manually [14:47:08] <_joe_> hashar: you should be able to log into contint1001 [14:47:10] could be kafka client doing it i guess [14:47:24] _joe_: fails somehow :( [14:47:37] <_joe_> hashar: let me check [14:47:49] 06Operations, 06Labs, 10wikitech.wikimedia.org: Changing username on WikiTech - https://phabricator.wikimedia.org/T137315#2364602 (10Peachey88) [14:48:18] _joe_: my ssh pub key should have the description " hashar@postwater_wmf " [14:48:18] stopped cp on scb1001 [14:48:32] no good mobrovac? [14:48:48] <_joe_> hashar: ahaha, lol [14:48:49] 06Operations, 06Labs, 10wikitech.wikimedia.org: Changing username on LDAP - https://phabricator.wikimedia.org/T137315#2364603 (10Dereckson) [14:48:52] <_joe_> found the issue [14:48:56] no, as soon as all the workers came up, cpu was back to 95%, ottomata [14:49:18] 06Operations, 06Labs, 10wikitech.wikimedia.org, 07LDAP: Changing username on LDAP - https://phabricator.wikimedia.org/T137315#2364589 (10Dereckson) [14:49:39] (03PS1) 10Giuseppe Lavagetto: hiera: rename host file [puppet] - 10https://gerrit.wikimedia.org/r/293322 [14:49:41] hm, ok mobrovac, is it because cp has a big backlog now? [14:49:42] ottomata: i'll create a patch to limit zk no cons to 200 or so just to see if that is related [14:49:45] <_joe_> hashar: ^^ [14:49:48] ok [14:49:59] mobrovac: and i have merged the do not subscribe zk patch [14:50:05] so that shouldn't bite us again [14:50:12] gr8 [14:50:15] _joe_: :-D [14:50:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] jenkins: declare the home directory explicitly [puppet] - 10https://gerrit.wikimedia.org/r/293321 (owner: 10Giuseppe Lavagetto) [14:50:37] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hiera: rename host file [puppet] - 10https://gerrit.wikimedia.org/r/293322 (owner: 10Giuseppe Lavagetto) [14:51:05] I was not paying attention when I created that file bah [14:51:53] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [14:53:17] !log rebooting gallium with netboot for hardware maintenance [14:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:53:32] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [14:54:07] (03PS2) 10Elukey: Limit the maximum broker topic log size to 10TB. [puppet] - 10https://gerrit.wikimedia.org/r/293270 (https://phabricator.wikimedia.org/T136690) [14:56:06] (03PS1) 10Mobrovac: Zookeeper: Limit the number of connections to 200 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/293323 [14:56:11] ottomata: ^^ [14:57:16] just for eqiad mobrovac? [14:57:22] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#1596769 (10Lea_WMDE) Could you please add me to #project-admins? I'm product manager of the WMDE's #tcb-team and as as such it would be g... [14:57:24] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [14:57:32] what? ^? [14:57:37] ottomata: yes, for now just to test [14:57:38] haha [14:57:51] !log rolling out the new Varnishkafka version in cache misc (didn't do it before since there was an outage ongoing) [14:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:58:05] mobrovac: did it settle? [14:58:15] stopped it [14:58:18] oh k [14:58:27] all endpoints are healthy! :p [14:58:32] haha [14:58:49] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2364614 (10Whatamidoing-WMF) [14:59:05] (03PS2) 10Ottomata: Zookeeper: Limit the number of connections to 200 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/293323 (owner: 10Mobrovac) [14:59:25] (03CR) 10Ottomata: [C: 032 V: 032] Zookeeper: Limit the number of connections to 200 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/293323 (owner: 10Mobrovac) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160608T1500). Please do the needful. [15:00:04] yurik: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:01:03] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: puppet fail [15:01:23] hm, mobrovac that didn't have any effect [15:01:28] I'm cancelling SWAT today since CI is down. Also just one patch. [15:01:43] mobrovac: i'll try [15:02:32] try what? [15:02:55] a diff patch, that hiera didn't change anything ;? [15:03:10] <_joe_> !log contint1001: systemctl mask zuul,zuul-merger [15:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:53] (03PS1) 10Giuseppe Lavagetto: contint1001: add zuul [puppet] - 10https://gerrit.wikimedia.org/r/293324 (https://phabricator.wikimedia.org/T137265) [15:04:07] (03PS2) 10Giuseppe Lavagetto: contint1001: add zuul [puppet] - 10https://gerrit.wikimedia.org/r/293324 (https://phabricator.wikimedia.org/T137265) [15:04:15] ottomata: hm probably needs changes in the zk module [15:04:38] zk::server should explicitly include zk and change that param [15:04:47] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:05:16] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [15:05:23] naw the zookeeper role explicitly includes it [15:05:49] (03PS3) 10Giuseppe Lavagetto: contint1001: add zuul [puppet] - 10https://gerrit.wikimedia.org/r/293324 (https://phabricator.wikimedia.org/T137265) [15:06:10] ottomata: i don't see that in role::zk::server [15:06:23] it is in ::client [15:06:25] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] contint1001: add zuul [puppet] - 10https://gerrit.wikimedia.org/r/293324 (https://phabricator.wikimedia.org/T137265) (owner: 10Giuseppe Lavagetto) [15:06:29] ah zk::client [15:06:48] ottomata: kk, i can change there [15:06:52] mobrovac: patching now [15:06:57] kk [15:07:07] you had the right var, no? [15:07:08] ja [15:07:09] ^^^ fixing lutetium puppet [15:07:11] dunno why that din't wokr [15:07:15] i'm putting it in eqiad/zookeeper.yal [15:07:17] yaml [15:07:24] git review is taking fOReverrrrr [15:07:28] (03PS1) 10Ottomata: Move zookeeper max_client_connections to zookeeper.yaml in eqiad for testing [puppet] - 10https://gerrit.wikimedia.org/r/293325 [15:07:31] there it goes [15:07:47] back [15:07:51] (03PS2) 10Ottomata: Move zookeeper max_client_connections to zookeeper.yaml in eqiad for testing [puppet] - 10https://gerrit.wikimedia.org/r/293325 [15:07:59] (03CR) 10Ottomata: [C: 032 V: 032] Move zookeeper max_client_connections to zookeeper.yaml in eqiad for testing [puppet] - 10https://gerrit.wikimedia.org/r/293325 (owner: 10Ottomata) [15:08:09] anomie, are you swating? [15:08:22] yurik: I'm not [15:08:29] do you know who is? [15:08:41] yurik: I'm cancelling SWAT today since CI is down. Also just one patch. [15:08:46] that worked [15:08:57] bummer :( [15:09:08] yurik: yeah, sorry :( [15:09:09] ottomata: cool! [15:09:10] brb [15:09:10] thcipriani, any hope to get that one little patch in? :) [15:09:27] its breaking WV maps :( [15:09:33] "what does this red button do" [15:10:01] thcipriani, should we aim for the later swat today then? [15:10:19] or should i depl it myself midday during some window? [15:10:38] mobrovac: https://www.youtube.com/watch?v=knLzNk0QeoQ [15:10:51] ottomata: lmk when you have restarted zk [15:11:33] yurik: Lots of folks working on CI, hopefully it will be back shortly. I can do a quick swat deploy before the train if you'll be around then. [15:12:11] thcipriani, could you depl it even if i'm not around? I am not at home atm. Its a very minor and simple depl, and won't make anything worse :) [15:12:39] b [15:12:45] mobrovac: disregard that link, i didn't realize that someone had played games with the ending [15:12:52] !log restarting zookeeper 1 by 1 in eqiad [15:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:00] yurik: in this instance, sure, patch seems fairly innocuous. How do I test once deployed? [15:13:05] urandom: yeah, at the end i was like wth? [15:13:09] urandom: I just wanted to tell you about that [15:13:48] thcipriani, https://en.wikivoyage.org/wiki/User:Yurik/Sandbox/Salzburg#Get_around -- the colors of the boxes in the text should change [15:13:51] i was looking for the clip from the cartoon, and had made it most of the way through before pasting it [15:13:53] see 1,2,3 [15:14:11] urandom: https://www.youtube.com/watch?v=kfj0sRiEqx4 :) [15:14:14] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2364695 (10hashar) @joe got a new server, did a nice partition schema based on lvm. Had to poli... [15:15:04] mobrovac: that's a different sort of horrific :) [15:15:25] :P [15:15:53] ottomata: done? [15:16:39] yurik: ack, sounds good. I'll make a note on the deployment page. [15:16:46] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:16:55] thcipriani, you might need to resave the page [15:17:05] 0-asve [15:17:11] 0-save [15:17:14] mobrovac: zk restarted [15:17:24] kk, starting changeprop back up [15:17:33] fingers crossed [15:17:44] * urandom lights a black candle [15:18:36] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [15:18:45] \o/ [15:18:48] i like the look of that [15:19:02] ah ha! [15:19:04] mobrovac: ! [15:19:04] 2016-06-08 15:19:00,271 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /10.64.0.16 - max is 200 [15:19:32] let's all remember to listen to elukey's words of caution. hahhaha [15:19:54] yuhuu [15:20:16] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [15:20:39] hm, ok, so for some reason connecting to zk brings too heavy a toll on cp [15:20:57] during start-up cpu went up to 35%, but now it has settled back to 7% [15:21:30] still lots of too many connection errors [15:21:34] pretty much constant now [15:21:35] many per second [15:22:04] y so many conns? :) [15:23:02] ottomata: in eqiad? and they keep flowing in? [15:23:15] yes [15:23:27] strangely not on conf1002 [15:23:30] but on both 1001 and 1003 [15:23:32] trying to reconnect [15:23:40] from the same ip i guess? [15:23:47] yes .16 [15:23:52] 1001 [15:23:55] scb1001 [15:24:31] we definitely need more conns then 200, maybe internally zk client tries to reconnect silently? [15:25:41] most likely Pchelolo, since the worker doesn't die or nothing [15:25:57] lemme look at the code [15:26:41] ottomata: have you restarted codfw too? [15:26:53] its really blasting logs, 50/s [15:27:06] mobrovac: that change only applied to eqiad [15:27:12] so no [15:28:38] kafka-node silently reconnects ZK indefinitely - that's why 'too many connections' logs [15:29:45] Pchelolo: we should definitely limit that to 10 or so and with geometric progression intervals [15:30:41] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2194449 (10MoritzMuehlenhoff) apt-xapian-index was throwing errors on random hosts. Since it's entirely unused (it was only installed on approx 350 systems which were installed at a time whe... [15:31:53] mobrovac: can we just set a higher hard limit right now? would that calm things? [15:31:57] 1024? [15:32:07] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2364702 (10hashar) Status: @jcrespo has taken backups and is dealing with the disk failure + RA... [15:32:34] ottomata: yeah, let's try that [15:32:39] ok on it [15:32:43] (03CR) 10Paladox: [C: 031] contint: cleanup gallium / use contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/293283 (https://phabricator.wikimedia.org/T137265) (owner: 10Hashar) [15:32:44] gonna set it for codfw too [15:33:40] (03CR) 10Hashar: [C: 04-1] "Pending potential restauration of the service on gallium." [puppet] - 10https://gerrit.wikimedia.org/r/293283 (https://phabricator.wikimedia.org/T137265) (owner: 10Hashar) [15:34:06] mobrovac: I'm wondering why only the summary rule is processed like crazy right now, and mobile-apps for example is not.. [15:35:13] (03CR) 10JanZerebecki: "Yes as was said one could parametrize the contacts in the class definition with a parameter, use a parameter in a defined type, in hiera p" [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [15:35:49] PROBLEM - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused [15:36:08] PROBLEM - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [15:36:33] contint1001 alarms can be acknowledged [15:36:47] zuul there is not ready yet / masked in systemd [15:38:49] <_joe_> hashar: I'll ack them in a few [15:40:13] (03PS1) 10Ottomata: Set max conns to 1024 for all zookeepers [puppet] - 10https://gerrit.wikimedia.org/r/293326 [15:40:55] (03CR) 10Ottomata: [C: 032 V: 032] Set max conns to 1024 for all zookeepers [puppet] - 10https://gerrit.wikimedia.org/r/293326 (owner: 10Ottomata) [15:42:57] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2364726 (10Papaul) @jcrespo I am sending the log file to the Dell support engineer, I will update you on the status. [15:43:54] !log restarting zk in codfw and eqiad 1 by 1 to apply maxClientCnxns=1024 [15:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:01] mobrovac: are you ready for ^^? [15:45:10] i just restarted one zk in codfw [15:45:17] kk, stopping cop [15:45:19] cp [15:45:21] k [15:45:27] stop in both codfw and eqiad if you can [15:45:46] kk [15:45:49] lemme know when i can proceed [15:46:07] ottomata: good to go for codfw [15:46:36] (03PS1) 10Elukey: Force varnishkafka (compatible with Varnish 4) to output the Resp timestamp. [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) [15:46:40] ottomata: ok for eqiad too [15:46:57] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2364753 (10jcrespo) It seems as if the RAID operations were successful, but it got stuck on boot... [15:47:59] mobrovac: i guess codfw is good to go, but i think i see cp running there [15:48:18] hm, maye not [15:48:19] not sure [15:48:24] damn puppet probably [15:48:34] ottomata: you restarted in codfw? [15:48:38] ja [15:48:46] so i'll (re)start cpo there then [15:48:54] (03PS2) 10Elukey: Force varnishkafka (compatible with Varnish 4) to output the Resp timestamp. [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) [15:48:55] mobrovac: in eqiad too [15:49:09] ACKNOWLEDGEMENT - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused Giuseppe Lavagetto Work in progress installation. [15:49:09] ACKNOWLEDGEMENT - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures Giuseppe Lavagetto Work in progress installation. [15:49:09] ACKNOWLEDGEMENT - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused Giuseppe Lavagetto Work in progress installation. [15:49:10] ACKNOWLEDGEMENT - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server Giuseppe Lavagetto Work in progress installation. [15:49:23] kk ottomata [15:49:38] PROBLEM - changeprop endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:50:07] <_joe_> uh what's that? ^^ [15:50:42] (03CR) 10Ottomata: [C: 031] Force varnishkafka (compatible with Varnish 4) to output the Resp timestamp. [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [15:50:47] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:50:52] 06Operations, 10ops-codfw, 06Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2364756 (10Papaul) a:05Papaul>03Andrew [15:50:57] _joe_: all good, cp was stopped there for zk restart [15:51:11] ottomata: starting on scb1001 [15:51:15] k [15:51:18] watching [15:51:38] RECOVERY - changeprop endpoints health on scb2002 is OK: All endpoints are healthy [15:52:47] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [15:53:56] mobrovac: so far just the 'normal' NodeExists stuff [15:54:09] yup ottomata, looking good on this side too [15:54:32] seems to have calmed down [15:54:38] ottomata: ok, start-up sequence completed, all workers are up! [15:54:55] but still lots of CONNECTION_LOSS in logstash [15:55:08] Pchelolo: that's scb200x [15:55:16] ah no scb1001 [15:55:18] (03CR) 10Giuseppe Lavagetto: [C: 031] DNS: Add prod DNS for mw2215-mw2238 Bug:T135466 [dns] - 10https://gerrit.wikimedia.org/r/292307 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [15:55:31] ya zk logs are more active there with those expired sessions [15:55:59] ottomata: what zk version is in prod? [15:56:27] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2364784 (10mmodell) >>! In T706#2364449, @Danny_B wrote: > @mmodell Re the topic above: Can actually creating of milestones be separate r... [15:56:52] Pchelolo: 3.4.5+dfsg-2 [15:57:09] ottomata: session issue might be related to https://issues.apache.org/jira/browse/ZOOKEEPER-1382 [15:57:33] at least they write about similar log patterns there [15:58:15] (03CR) 10Eevans: "> Yes as was said one could parametrize the contacts in the" [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [15:58:38] ok, so, i'm not completely certain that cp caused the outage alone [15:58:48] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:58:59] (03PS3) 10BBlack: Support optional keepalives and websockets for v4 only [puppet] - 10https://gerrit.wikimedia.org/r/287941 (https://phabricator.wikimedia.org/T134870) [15:59:01] (03PS3) 10BBlack: tlsproxy: turn proxy_request_buffering off for v4 [puppet] - 10https://gerrit.wikimedia.org/r/287996 [15:59:10] hm, actually, it's plausible [15:59:47] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 625 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5576970 keys - replication_delay is 625 [16:00:07] Pchelolo: ja maybe so! [16:00:14] ha, and fixed by neha 4 years ago :p [16:00:33] mobrovac: it didn't have that high rates of event processing and didn't use too much memory [16:00:41] yes [16:00:43] how could it kill all the services on scb? [16:00:48] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 5.685 second response time [16:01:01] what will happen if all of zk goes down for a sec? [16:01:13] its possible that happened for a few secs when we applied the first puppet change [16:01:34] it's most likely what happened [16:01:44] ottomata: from reading the code - cp should reconnect [16:01:49] is it possible the kafka-node client could cause crazy cpu usage and starve other stuff if so? [16:02:54] !log temporary set a 10TB upperbound to the Kafka webrequest_text topic to free space (T136690) [16:02:55] T136690: Kafka 0.9's partitions rebalance causes data log mtime reset messing up with time based log retention - https://phabricator.wikimedia.org/T136690 [16:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:03:03] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/redis - 256 bytes in 12.006 second response time [16:03:53] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5528812 keys - replication_delay is 0 [16:04:53] RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 29.643 second response time [16:07:00] !log scb1002 enabling back puppet [16:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:08:04] 06Operations, 06Performance-Team, 06Services, 07Availability: Create restbase BagOStuff subclass - https://phabricator.wikimedia.org/T137272#2364843 (10GWicke) [16:08:27] 06Operations, 06Performance-Team, 06Services, 07Availability: Create restbase BagOStuff subclass (session storage) - https://phabricator.wikimedia.org/T137272#2363223 (10GWicke) [16:08:44] RECOVERY - zuul_merger_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [16:08:54] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [16:09:40] ottomata: can you see any zk logs for scb1002? [16:10:13] ip 10.64.16.21 [16:10:44] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [16:10:47] oh really [16:11:52] 06Operations, 06Performance-Team, 06Services, 07Availability: Create restbase BagOStuff subclass (session storage) - https://phabricator.wikimedia.org/T137272#2364852 (10Smalyshev) a:05aaron>03Smalyshev [16:12:21] mobrovac: did you restart cp again? i see logs for .0.16, scb1001 [16:12:32] also lots of NodeExists again [16:12:35] no ottomata, i started it on scb1002 [16:12:41] oh yes, i do see it [16:12:44] 16.21s [16:12:47] ok, makes sense [16:12:47] yes, that's normal, scb1002 is still starting up [16:12:58] k [16:13:01] 06Operations, 06Performance-Team, 06Services, 07Availability: Consider cassandra for session storage (with SSL) - https://phabricator.wikimedia.org/T134811#2364861 (10Smalyshev) Regarding the drivers - php-driver is a PHP extension, so we'd need to port it to HHVM if we want to use it in production. I'll c... [16:13:16] (03PS1) 10Muehlenhoff: Provide the firejail containment for imagemagick's convert(1) on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/293328 (https://phabricator.wikimedia.org/T135111) [16:16:53] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:18:45] PROBLEM - salt-minion processes on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:18:48] (03PS2) 10Muehlenhoff: Provide the firejail containment for imagemagick's convert(1) on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/293328 (https://phabricator.wikimedia.org/T135111) [16:18:53] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:13] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:26] my ssh session froze on scb1002 [16:19:37] ottomata: can you try logging in? [16:19:43] PROBLEM - Check size of conntrack table on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:49] ottomata: can you depool scb1002? [16:19:53] PROBLEM - puppet last run on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:54] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:55] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:20:03] PROBLEM - MD RAID on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:20:15] PROBLEM - mathoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:20:33] PROBLEM - SSH on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:30] ok, this is really strange now [16:21:34] PROBLEM - dhclient process on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:22:14] PROBLEM - MegaRAID on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:22:28] Failed to stop changeprop.service: Connection timed out [16:22:34] that's a first one from systemd [16:22:52] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2364905 (10hashar) [16:22:54] PROBLEM - ores on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:09] can someone depool scb1002 please [16:23:13] PROBLEM - ores uWSGI web app on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:23:23] PROBLEM - configured eth on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:23:24] PROBLEM - DPKG on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:23:54] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2364922 (10hashar) We might need... [16:24:24] !log restarting hadoop-yarn-resourcemanager on analytics1002 to make analytics1001 active [16:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:38] elukey ottomata disk swap in about 15 mins? [16:24:44] RECOVERY - ores on scb1002 is OK: HTTP OK: HTTP/1.0 200 OK - 2801 bytes in 2.760 second response time [16:24:44] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:02] elukey: ^^ can you take care of that? [16:25:10] disk swap? [16:25:31] ottomata: can you depool scb1002? [16:26:06] oh boy how to depool, wikitech searching... [16:26:31] akosiaris: around? [16:27:34] PROBLEM - Disk space on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:53] uh oh mobrovac i can't log into 1002 [16:28:13] oh sorry mobrovac was in standup reading backlog [16:28:18] i know, i told you that 10 mins ago [16:28:21] kk [16:28:27] let's fix this [16:28:37] k i'm looking for how to depool, not sure how [16:28:43] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.016 second response time [16:28:46] is this etcd managed? [16:28:56] you should remove each of the services from lvs [16:29:00] ottomata: afaik, yes [16:29:09] oh which are the services? [16:29:11] aye i can't do all at once [16:29:11] hm [16:29:46] found it [16:29:56] i think i can do all at once... [16:30:01] graphoid, citoid, cxserver, mobileapps, ores, mathoid [16:30:20] hm maybe [16:30:55] PROBLEM - ores on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:31:04] trying to log into palladium.. [16:31:13] ther eit goes [16:31:25] running confctl --tags dc=eqiad,cluster=scb --action set/pooled=no scb1002.eqiad.wmnet [16:31:45] ah nope, have to do each service [16:32:12] 06Operations, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258#2328753 (10thcipriani) All patches for this task have merged. Anything left to do? [16:32:13] !log otto@palladium conftool action : set/pooled=no; selector: scb1002.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=graphoid']) [16:32:15] !log otto@palladium conftool action : set/pooled=no; selector: scb1002.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=citoid']) [16:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:17] !log otto@palladium conftool action : set/pooled=no; selector: scb1002.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [16:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:18] !log otto@palladium conftool action : set/pooled=no; selector: scb1002.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [16:32:20] !log otto@palladium conftool action : set/pooled=no; selector: scb1002.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [16:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:24] !log otto@palladium conftool action : set/pooled=no; selector: scb1002.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid']) [16:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:24] oh look at that, it logs for me! [16:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:00] mobrovac: i think that wokred [16:33:41] cool [16:33:45] that's progress :) [16:34:12] still no luck with logging into scb1002 [16:35:58] mobrovac: no response on serial console [16:36:01] powercycle? [16:36:29] !log Disabled puppet on contint1001 to prevent it from bringing back Jenkins [16:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:36:40] ottomata: yeah, go ahead, the services are depooled [16:37:35] !log powercycling scb1002 [16:40:27] cmjohnson1: i can do disk swap with ya [16:40:37] (03PS1) 10Gehel: Maps tables should all be owned by osmimporter [puppet] - 10https://gerrit.wikimedia.org/r/293331 (https://phabricator.wikimedia.org/T134901) [16:40:45] ottomata: great...give me 1 min to get back to the cage [16:40:52] k [16:41:27] RECOVERY - configured eth on scb1002 is OK: OK - interfaces up [16:41:27] RECOVERY - MD RAID on scb1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [16:41:34] mobrovac: ^ [16:41:37] RECOVERY - DPKG on scb1002 is OK: All packages OK [16:41:38] RECOVERY - MegaRAID on scb1002 is OK: OK: no disks configured for RAID [16:41:57] RECOVERY - dhclient process on scb1002 is OK: PROCS OK: 0 processes with command name dhclient [16:42:04] kk thnx ottomata, stopped changeprop [16:42:07] RECOVERY - Check size of conntrack table on scb1002 is OK: OK: nf_conntrack is 0 % full [16:42:07] RECOVERY - SSH on scb1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [16:42:08] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 34 minutes ago with 0 failures [16:42:09] RECOVERY - Disk space on scb1002 is OK: DISK OK [16:42:29] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [16:42:37] RECOVERY - salt-minion processes on scb1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:42:57] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy [16:42:58] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [16:43:08] RECOVERY - mathoid endpoints health on scb1002 is OK: All endpoints are healthy [16:43:12] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2364976 (10hashar) The RAID array is rebuilding on gallium, would take ~1 hour and half. Puppet... [16:43:37] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [16:43:48] RECOVERY - ores on scb1002 is OK: HTTP OK: HTTP/1.0 200 OK - 2801 bytes in 0.028 second response time [16:44:04] ottomata: will you be around for the next hour or so? [16:44:06] ottomata: kafka1012 first [16:44:13] (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/292516 (owner: 10Ladsgroup) [16:44:17] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [16:44:39] no on [16:44:57] RECOVERY - ores uWSGI web app on scb1002 is OK: ● uwsgi-ores.service - uwsgi-ores uwsgi app [16:45:03] mobrovac: i would like to take a nap...i got up at 3:15 am this morn...was gonna quit early :) [16:45:14] cmjohnson1: ja [16:45:16] am ready [16:45:25] cmjohnson1: if you are there, i'll stop puppet and kafka broker [16:45:27] ottomata: i've been here for almost 12h myself [16:45:28] actually [16:45:31] do we know which disk it is [16:45:32] hm [16:45:36] ottomata: i'm here [16:45:39] sdf [16:45:48] (03CR) 10BBlack: [C: 04-1] Force varnishkafka (compatible with Varnish 4) to output the Resp timestamp. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [16:45:50] ahhh cmjohnson1 [16:46:01] gimme one moment sorry, if i copy the data elsewhere, it will ease recovery [16:46:09] cool. take your time [16:46:36] !log stopping kafka broker and puppet on kafka1012 to replace sdf [16:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:47:03] (03PS1) 10Muehlenhoff: Update to 4.4.13 [debs/linux44] - 10https://gerrit.wikimedia.org/r/293333 [16:50:08] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:50:17] (03PS1) 10Mobrovac: Change Prop: Fix the number of workers to 8 temporarily [puppet] - 10https://gerrit.wikimedia.org/r/293334 [16:50:32] ottomata: mind reviewing / merging ^ ? [16:50:42] !log cloning /var/lib/jenkins from db1085 to contint1001 [16:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:04] ottomata: are you copying' data from /dev/sdf right now? [16:51:07] 06Operations, 06Performance-Team, 06Services, 07Availability: Consider cassandra for session storage (with SSL) - https://phabricator.wikimedia.org/T134811#2365026 (10aaron) If we go with restbase, then the subclass will just use MultiHttpClient. It could either talk to a local http restbase endpoint (whic... [16:51:09] yes [16:51:13] cmjohnson1: from sdf to sdc [16:51:15] to sdc [16:51:20] cool..i know which one it is then [16:51:24] cool [16:51:52] (03CR) 10Ottomata: [C: 032 V: 032] Change Prop: Fix the number of workers to 8 temporarily [puppet] - 10https://gerrit.wikimedia.org/r/293334 (owner: 10Mobrovac) [16:52:13] mobrovac: merged [16:52:19] ottomata: thnx a bunch! [16:52:26] PROBLEM - Kafka Broker Server on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [16:54:45] ottomata: is ^ something you can deal with? [16:54:45] oh i shoulda acked that [16:54:46] sorry yall [16:54:49] ok [16:54:56] andrewbogott: expected [16:54:58] we are swapping a disk [16:54:59] sorry [16:55:06] (a lot going on today...) [16:56:08] hm, cmjohnson1 kinda slow. [16:56:15] only 32G out of 500 or so... [16:56:38] oh..okay..how much time do you think? [16:56:48] I can come back to it [16:57:09] wanna go for an1049 [16:57:39] cmjohnson1: guessing another half an hour at least [16:57:45] cmjohnson1: checking [16:58:40] ottomata before we do that I need some log output showing failure that is still under warranty [16:59:13] (03PS2) 10Muehlenhoff: Update to 4.4.13 [debs/linux44] - 10https://gerrit.wikimedia.org/r/293333 [16:59:22] cmjohnson1: ok [16:59:23] uh [16:59:29] /dev/sdc: read failed after 0 of 4096 at 0: Input/output error [16:59:32] what do you need? [16:59:33] :) [16:59:49] usually syslog will have lines of disk errors [17:00:05] also possibly the ilom log, i'll check the ilom log for ya ottomata [17:00:21] robh, analytics1049, sdc [17:01:14] so ilom log doesnt have disk error info for this, which makes sense but always good to check [17:01:21] 06Operations, 10Traffic, 06WMF-Legal, 05Security: policy.wikimedia.org SSL vulnerability - https://phabricator.wikimedia.org/T137258#2365049 (10Slaporte) WordPress tells me that they have resolved the issue. It [looks fixed](https://www.ssllabs.com/ssltest/analyze.html?d=policy.wikimedia.org) to me, but th... [17:01:34] seems this is hw raid? [17:01:42] https://gist.github.com/ottomata/d45314fba9f6f2c7ce0161d8c26abd84 [17:01:42] ? [17:02:03] yeah, that is good to have [17:02:35] ja [17:02:37] also [17:02:38] Device Present [17:02:38] ================ [17:02:38] Virtual Drives : 13 [17:02:38] Degraded : 0 [17:02:38] Offline : 1 [17:02:38] Physical Devices : 15 [17:02:38] Disks : 14 [17:02:39] Critical Disks : 0 [17:02:39] Failed Disks : 1 [17:02:56] so the hw controller has a log as well [17:03:15] updated gist [17:03:18] https://gist.github.com/ottomata/d45314fba9f6f2c7ce0161d8c26abd84 [17:03:44] ottomata: run -AdpEventLog -GetEvents -f events.log -aALL && cat events.log [17:03:50] the output of that has the best info for failure conditions [17:03:56] it has all the disk events for the controller [17:04:03] sorry, run megalcli -AdpEventLog -GetEvents -f events.log -aALL && cat events.log [17:04:11] i left out the important bit ;] [17:04:31] and i can see lots of failures for one of the disks in that log [17:04:35] robh its long ja [17:04:39] in my home but i guess you did it too [17:04:56] well, you should paste it all onto the task for cmjohnson1 to reference to dell [17:05:00] 06Operations, 10Traffic, 06WMF-Legal, 05Security: policy.wikimedia.org SSL vulnerability - https://phabricator.wikimedia.org/T137258#2362614 (10MoritzMuehlenhoff) Seems fine now [17:05:02] i didnt bother to copy down anything, figured you were ;] [17:05:10] but i can if you arent sure [17:05:10] k, uhh, cmjohnson1 did elukey make a task? checking... [17:05:13] if not i will make one [17:05:26] well, any disk failure has to have a task for him to work on it ;] [17:05:31] why yes he did [17:05:35] https://phabricator.wikimedia.org/T137273 [17:05:44] yep [17:05:48] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 56.67% of data above the critical threshold [10.0] [17:06:07] so yeah, i'd just paste in the megacli log of alarms as well [17:06:13] robh i can't paste this whole file [17:06:19] its 9.6M [17:06:25] 06Operations, 10ops-eqiad, 10Analytics-Cluster: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2363239 (10Ottomata) Also: ``` Jun 8 16:58:34 analytics1049 kernel: [7283582.453037] sd 0:2:2:0: [sdc] Jun 8 16:58:34 analytics1049 kernel: [7283582.453043] Result: hostbyte=D... [17:06:29] pasted in the stuff from syslog [17:06:37] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [10.0] [17:06:46] ah, i will schedule downtime for those too [17:06:48] nah, i'd only include the last day or two, not the entire file [17:06:57] of the megacli log [17:06:57] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [10.0] [17:07:06] basically showing enough log to show the disk in full error mode [17:07:17] its repeating the same thing over and over for days in there [17:07:18] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [10.0] [17:07:28] so even part of the last day is likely enough [17:07:33] ottomata: I am in a meeting but let me know if you need help [17:07:56] 06Operations, 10Traffic, 06WMF-Legal, 05Security: policy.wikimedia.org SSL vulnerability - https://phabricator.wikimedia.org/T137258#2365073 (10BBlack) 05Open>03Resolved Yup looks great, thanks! [17:08:27] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [10.0] [17:08:34] 06Operations, 10Traffic, 06WMF-Legal, 05Security: policy.wikimedia.org SSL vulnerability - https://phabricator.wikimedia.org/T137258#2365078 (10Slaporte) Excellent, thank you! [17:08:40] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2365079 (10jcrespo) More details before I go: there are several backups on `db1085:/srv/backup/... [17:09:07] 06Operations, 10ops-eqiad, 10Analytics-Cluster: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2365080 (10Ottomata) Some output from ``` sudo megacli -AdpEventLog -GetEvents -f events.log -aALL ``` ``` ... seqNum: 0x00005441 Time: Wed Jun 8 06:55:17 2016 Code: 0x0000... [17:10:39] cmjohnson1: you can swap sdc on analytics1049 at any time [17:11:02] 06Operations, 10ops-eqiad, 10Analytics-Cluster: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2363239 (10Cmjohnson) Enclosure Device ID: 32 Slot Number: 1 Drive's position: DiskGroup: 2, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 1 WWN: 500003964ba801e7 Sequence Num... [17:11:42] ottomata: remember https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Swapping_broken_disk (if you want to follow it) [17:12:03] ottomata: correction, just the last two entries of the megacli log is enough [17:12:08] since it seems to repeat it over an dover [17:12:09] =] [17:13:04] elukey: thanks I actually had just pulled that up [17:13:08] thanks for writing that up! [17:13:17] ottomata: can you write to sdc for me plz [17:13:28] cmjohnson1: don't think i can, its a little busted [17:13:31] i can try to mount it [17:13:43] megacli shows disk 1 as being failed which would not fit where I would think sdc would be located [17:14:00] nope mount: /dev/sdc1: can't read superblock [17:14:15] cmjohnson1: [17:14:19] k [17:14:20] sda and sdb are internal 2.5 drives [17:14:26] probably not part of the hw raid controller [17:14:36] ah that could be it [17:15:22] still doesn't fit....but I am going to pull slot 1 since it's failed [17:15:57] doesn't fit? ok. [17:16:14] disk swapped [17:16:37] cmjohnson1: sometimes I wish you had a live stream + go pro strapped to your head while you are in there [17:17:15] haha..yeah....what I meant was slot 0 should dev/sdc and slot1 would be /dev/sdd if it were a perfect world [17:17:28] oh right [17:18:23] cmjohnson1: trying to list status with megacli, but it is hanging [17:18:35] yep..having the same issue [17:20:03] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2364905 (10mark) If I read this... [17:20:57] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [17:21:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 4.4.13 [debs/linux44] - 10https://gerrit.wikimedia.org/r/293333 (owner: 10Muehlenhoff) [17:22:57] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [17:24:33] cmjohnson1: any luck? [17:24:37] no [17:25:16] ottomata: sorry yeah it's taking awhile cuz that disk is new [17:25:20] ok [17:29:09] (03PS1) 10BBlack: ssl sid cache: 25h + 4G [puppet] - 10https://gerrit.wikimedia.org/r/293340 [17:30:20] (03CR) 1020after4: [C: 031] "This is really coming together." [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [17:31:43] ottomata: there is preserved cache but will not discard without forcing...i will rather you do that [17:31:47] (03CR) 10BBlack: [C: 032 V: 032] ssl sid cache: 25h + 4G [puppet] - 10https://gerrit.wikimedia.org/r/293340 (owner: 10BBlack) [17:32:11] disk cache in mem? cmjohnson1? [17:32:22] just for what was on sdc? [17:32:25] yes [17:32:28] root@analytics1049:~# megacli -DiscardPreservedCache -L2 -a0 [17:32:28] Adapter #0 [17:32:28] One or more virtual drives are Offline. In order to discard preserved cache use -force option. This will discard the preserved cache & delete the offline virtual drives. [17:33:38] cmjohnson1: do it, if it was just sdc, i mean, it won't hurt to delete in mem disk cache for others i guess anyway [17:33:42] nothing is running right now [17:36:18] okay [17:44:55] cmjohnson1: s'ok? [17:45:01] ottomata: strange enough...the new disk is still reporting failed. [17:45:05] hm [17:45:07] just replaced it with a different one [17:45:13] ok [17:45:15] and still? [17:45:18] same thing [17:45:20] huh [17:45:25] so maybe controller problem? [17:45:26] wonder if it's not really slot 1 [17:45:32] oh [17:45:41] i mean slot shows failed on controller [17:45:50] but IDK ...it's weird [17:46:30] can help in any way? [17:46:54] I* [17:49:27] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.007 second response time [17:56:02] cmjohnson1: ? [17:56:18] working on it [17:56:49] k [17:57:18] RECOVERY - MegaRAID on analytics1049 is OK: OK: optimal, 12 logical, 13 physical [17:57:26] .\o/ [18:00:04] yurik gehel thcipriani: Respected human, time to deploy Scap3 Service Migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160608T1800). Please do the needful. [18:00:04] yurik: A patch you scheduled for Scap3 Service Migration is about to be deployed. Please be available during the process. [18:00:13] * gehel o/ [18:00:17] \o/ [18:00:27] also present. [18:00:30] * yurik runs away [18:00:33] * gehel is ready to learn scap3... [18:00:53] we don't need yurik anyway! :P [18:01:14] * gehel is joking, yurik knows more about that scap3 thing than I do [18:01:19] so thcipriani how do we start? [18:01:55] so the process for this in the past has been: merge puppet, run puppet on tin, run a deploy from tin that fails, run puppet on targets, the run a deploy from tin that succeeds (hopefully) [18:02:21] I'm not sure if the scap.cfg patches have merged into the repos just yet either, so that'll need to happen before anything else. [18:02:54] thcipriani, ok, so lets migrate kartotherian first - smaller and easier to test. [18:02:55] yurik: were you the one preparing those patches? [18:02:55] https://gerrit.wikimedia.org/r/#/c/285979/ and https://gerrit.wikimedia.org/r/#/c/285980/ [18:03:31] kk, so starting with kartotherian, this one should merge and then pull down to the /srv/deployment/kartotherian/deploy on tin: https://gerrit.wikimedia.org/r/#/c/285980/ [18:03:35] thcipriani: thanks! give me a minute to read them before merging... [18:04:20] and this is the corresponding puppet patch: https://gerrit.wikimedia.org/r/#/c/291930/ [18:04:23] thcipriani, gehel, but its only for the old cluster? gehel is building the new one at the moment [18:04:26] ottomata: i was able to fail out the old disk...even after removing it the controller still reported it being there. I added a new disk but the controller is hanging [18:05:05] yurik: as far as I can see, it is only the old cluster defined in "targets" [18:05:15] (03PS3) 10Papaul: DNS: Add prod DNS for mw2215-mw2250 and removed the old mw entries mw2001-mw2016/mw2018-mw2060 Bug:T135466 [dns] - 10https://gerrit.wikimedia.org/r/292307 (https://phabricator.wikimedia.org/T135466) [18:05:18] h ok [18:05:25] gehel, are the puppets running on the new one? should we deploy it to both at the same time? [18:05:29] thcipriani: I expect that I can add the new maps hosts after the fact and re-run a deploy ? [18:05:52] thcipriani, context: we have maps-test* cluster, and currently building the maps* cluster [18:05:53] (03CR) 10Elukey: Force varnishkafka (compatible with Varnish 4) to output the Resp timestamp. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [18:05:59] yurik: puppet not running, but I can re-enable, no issue [18:05:59] gehel: sure. you just edit the /srv/deployment/kartotherian/deploy/scap/targets file [18:06:17] ok, lets try old cluster first, and then add more and re-depl [18:06:28] this way we will be certain adding new clusters works ok [18:06:48] (03CR) 10Yurik: [C: 031] Scap3 config for Kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/291930 (https://phabricator.wikimedia.org/T129150) (owner: 10Thcipriani) [18:06:51] +1 ^ [18:06:54] (03PS3) 10Elukey: Force varnishkafka (compatible with Varnish 4) to output the Resp timestamp. [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) [18:07:28] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.184 second response time [18:07:39] thcipriani, i'm merging the https://gerrit.wikimedia.org/r/#/c/285980/1/scap/scap.cfg [18:07:49] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [18:08:07] ottomata: can't ping an1049 any longer [18:08:09] yurik: ack, looks good. [18:08:17] can i reboot it? [18:08:42] ja cmjohnson1 go ahead, hm, it might try to start hadoop stuff [18:08:43] on boot [18:08:53] i think it will fail, shoudln't hurt anything [18:09:18] yurik: oops, just I added a commit with the new servers [18:09:20] thcipriani, gehel, merged. I will rebuild the package now, so that we have the latest [18:09:58] gehel, should we remove and try one cluster at a time? [18:09:59] or do both? [18:10:41] yurik: let's do both, new maps servers don't have prod traffic anyway [18:10:47] ok [18:11:15] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2365294 (10hashar) Sorry it is n... [18:13:37] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:43] still rebuilding kartotherian package... takes a bit of time in a docker [18:13:52] * yurik hopes it will be ok [18:14:04] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2365303 (10mmodell) ``` RewriteRule "^/tree/(.+).git" https://phabricator.wikimedia.org/r/p/%1;browse/ [R=301,NE] ``` [18:14:15] (03CR) 10Elukey: "Tried with kafka configs --alter --entity-type topics --entity-name webrequest_text --add-config retention.bytes=10000000000000 but didn't" [puppet] - 10https://gerrit.wikimedia.org/r/293270 (https://phabricator.wikimedia.org/T136690) (owner: 10Elukey) [18:15:05] (03PS3) 10Elukey: filter out new metrics [puppet] - 10https://gerrit.wikimedia.org/r/290860 (owner: 10Eevans) [18:15:20] urandom: going to merge --^ [18:15:28] (03CR) 10BBlack: [C: 031] Force varnishkafka (compatible with Varnish 4) to output the Resp timestamp. [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [18:16:26] cmjohnson1: everything ok? [18:16:39] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2365308 (10mmodell) RewriteRule "^/log/(.+).git/refs/heads/(.*)" https://phabricator.wikimedia.org/r/p/%1;history/%2/... [18:16:44] (03CR) 10Elukey: [C: 032] filter out new metrics [puppet] - 10https://gerrit.wikimedia.org/r/290860 (owner: 10Eevans) [18:17:09] (03CR) 10Elukey: [V: 032] "Jenkins is down atm but this change was already +2 and then only rebased later on." [puppet] - 10https://gerrit.wikimedia.org/r/290860 (owner: 10Eevans) [18:17:18] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 1.017 second response time [18:17:43] urandom: merged! [18:17:52] thcipriani, gehel, kartotherian depl package is ready and has been merged [18:17:57] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [18:18:18] thcipriani: so there is also a puppet patch to merge? [18:18:24] * gehel looking for it... [18:18:34] gehel: yup. https://gerrit.wikimedia.org/r/#/c/291930/ [18:18:46] thcipriani: thanks [18:19:01] yurik: you'll need to pull that new code onto tin pre-deployment [18:19:16] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2365323 (10mmodell) ``` RewriteRule "^/commit/(.+)\.git/(\w+)" https://phabricator.wikimedia.org/r/revision/%1;%2 ``` [18:19:35] thcipriani, tin:/srv/deployment/kartotherian/deploy ? [18:19:54] i just did git pull & git submodule update [18:19:57] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5528624 keys - replication_delay is 0 [18:20:20] yurik: cool, yup I see the new scap dir there :) [18:20:31] thcipriani, yurik: puppet change looks trivial, but I'm sure there is a lot of magic behind. Rebasing and merging... [18:20:39] !log restarting changeprop service on scb1001 [18:21:27] (03PS2) 10Gehel: Scap3 config for Kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/291930 (https://phabricator.wikimedia.org/T129150) (owner: 10Thcipriani) [18:21:33] !log stopping changeprop service on scb1001 [18:21:59] ottomata: fixed [18:22:09] (03CR) 10Gehel: [C: 032 V: 032] "Jenkins still down, but change was only rebased since last check" [puppet] - 10https://gerrit.wikimedia.org/r/291930 (https://phabricator.wikimedia.org/T129150) (owner: 10Thcipriani) [18:22:21] all disks show and are online-spun up [18:22:36] oooo [18:22:47] !log switching maps to scap3 deployment [18:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:06] !log restarting changeprop service on scb1001 [18:23:08] ok, so we'll want to run puppet on tin first. [18:23:31] thcipriani, yurik: puppet is running on tin [18:23:54] thcipriani, is that something i need to do? [18:23:59] as part of the depl? [18:24:07] yurik: nope, just the first time [18:24:11] ^ [18:24:16] 06Operations, 10ops-eqiad, 10Analytics-Cluster: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2365341 (10Cmjohnson) Replaced the disk. The preserved cache was not able to be cleared using megcli commands. Had to reboot the server and dicard using the raid bios on-site.... [18:24:30] cmjohnson1: how far along did you get? [18:24:30] thcipriani, can you sync my earlier patch in the mean time? :) [18:24:35] do I need to make a raid0 thing? [18:24:38] i would check that too [18:24:49] guess not, i see /dev/sdc there [18:24:49] there might be two more minor CSS patches :( [18:24:59] ottomata: all should be good now [18:25:03] yurik: you would not have the sudo powers to run puppet anyway :( [18:25:06] did you make a partitoin? [18:25:08] i see sdc1 [18:25:17] (03PS4) 10Dzahn: git.wikimedia.org -> Diffusion redirects [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) [18:25:30] yurik: soon we will have CI and then all will be merged :) [18:25:32] * yurik googles for root escalation bugs [18:25:58] 06Operations, 10ops-eqiad, 10Analytics-Cluster: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2365359 (10Cmjohnson) Leave the ticket open until I get the warranty part from Dell. [18:26:11] yurik: after puppet runs on tin, you'll run a 'scap deploy' which should fail (since puppet will have to run on the targets before it succeeds) [18:26:29] thcipriani: puppet run completed, only change I see is about Salt::Grain[deployment_server]/Exec[ensure_deployment_server_true] [18:26:35] cmjohnson1: ^^^^^ [18:26:37] yurik: actually, run: scap deploy --init [18:27:13] gehel: yeah, it should be mostly a noop, just changing the thing that is responsible for /srv/deployment/kartotherian/deploy from salt to puppet. [18:27:15] ohhh cmjohnson1 they all slid down one? [18:27:33] ok that's fine [18:27:36] i see sdm [18:27:41] yurik: ok, so run: scap deploy --init from /srv/deployment/kartotherian/deploy [18:27:41] yeah, i didn't mess w/partitions [18:27:43] thcipriani, i was hoping to work with the community today wrt styling. [18:27:46] ok, running [18:27:57] done [18:28:37] gehel: ok, go ahead and run puppet on the target machines please [18:28:51] yurik: that generated at /srv/deployment/kartotherian/deploy/.git/DEPLOY_HEAD file [18:29:10] which is the configuration for the upcoming deployment [18:29:28] PROBLEM - puppet last run on elastic1012 is CRITICAL: CRITICAL: Puppet has 1 failures [18:29:45] thcipriani, yurik: puppet running on maps* [18:31:10] cmjohnson1: about 30G to go over on kafka1012 [18:31:34] o [18:31:36] ok [18:32:06] thcipriani, yurik: puppet is done [18:33:39] should i do anything now? [18:33:44] yurik: ok, now for the deployment. This will fetch the current version of the code from tin, and restart and check the kartotherian service is running on port 6533. Go ahead and run: 'scap deploy -v' from /srv/deployment/kartotherian/deploy please [18:34:10] thcipriani, scary :) can we do it on one server only? [18:34:13] or not restart? [18:34:18] i can restart manually afterwards [18:34:29] PROBLEM - changeprop endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [18:35:46] yurik: sure. so to *not* restart just comment out: service_name and service_port in the scap/scap.cfg file. To run a canary deployment on only one server add that server name to target-canary and uncomment server_groups and canary_dsh_target in the scap.cfg file [18:36:28] RECOVERY - changeprop endpoints health on scb2002 is OK: All endpoints are healthy [18:36:36] if you run a canary deploy, a full deploy will run on the canary servers specified, and then it will prompt you to continue the deployment [18:36:58] you'll need to re-run scap deploy --init to regenerate the DEPLOY_HEAD file [18:37:01] thcipriani, so i should modify the scap.cfg locally without checikng it in? [18:37:38] yurik: you can if you want to do this just for one deploy, or you can commit it to do this every time. [18:37:55] previous git deploy didn't allow me to mess with local files :) [18:37:59] sec [18:38:05] 06Operations, 06Labs, 10wikitech.wikimedia.org, 07LDAP: Changing username on LDAP - https://phabricator.wikimedia.org/T137315#2365391 (10demon) 05Open>03Resolved Renamed in LDAP and Wikitech. I don't see you in Gerrit so I didn't do anything there. [18:44:01] thcipriani, git diff please and see if i did it right [18:44:16] should i remove maps2001 from the targets? [18:44:20] because now its in both [18:45:01] doesn't matter. it'll go with the earliest file it's in, shouldn't deploy twice. [18:45:19] yurik: lgtm, re-run: scap deploy --init to regenerate .git/DEPLOY_HEAD [18:45:21] thcipriani, i guess the best way is to deploy it to one server (e.g. 2001), restart it, test it automatically, let me manually test that its ok as well, and then i will say "yes" to continue [18:45:32] how would i set that up? [18:45:57] re-ran init [18:46:39] so if you uncomment the service_name and service_port, it'll do the service restart and check the port on all the machines including the canary. But you will be prompted after the canary deploy to continue so you can check manually. [18:47:08] however, service restart will then also happen on the non canary targets [18:47:14] thcipriani, right, but i don't want it to restart all machines, only the canary one [18:47:37] until it deploys to all :) [18:47:38] right now you could only set that up with a custom check [18:47:54] ok, manual restart it is :) [18:48:11] thcipriani, what's my next step now? [18:48:17] i did the --init [18:48:52] yup, I saw that (stalking via scap deploy-log :)) [18:49:13] and now "scap deploy -v" ? [18:49:16] ^ [18:49:53] oh, right... hang on to your hats! [18:50:48] PROBLEM - puppetmaster https on labcontrol1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 503 Service Unavailable [18:50:48] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 0.154 second response time [18:51:01] thcipriani, it said it restarted something? [18:51:22] robh: yt? [18:51:26] yurik: no, not since we commented out the service_name and service_port [18:51:26] got some megacli Qs [18:51:34] ottomata: sup? [18:51:43] kartotherian/deploy: promote and restart_service stage(s): 100% (ok: 1; fail: 0; left: 0) [18:51:43] ok [18:51:47] chris just replaced disk on kafka1012 [18:51:59] the other disks state is [18:52:06] Enclosure Device ID: 32 [18:52:06] Slot Number: 4 [18:52:06] Firmware state: JBOD [18:52:10] the replaced is [18:52:16] Enclosure Device ID: 32 [18:52:16] Slot Number: 5 [18:52:16] Firmware state: Unconfigured(good), Spun Up [18:52:20] i'm trying to make JBOD [18:52:21] with [18:52:25] megacli -PDMakeJBOD -PhysDrv[32:5] -a0 [18:52:31] but that gives Adapter: 0: Failed to change PD state at EnclId-32 SlotId-5. [18:52:34] yurik: but it should have swapped out the code, so now on that host: /srv/deployment/kartotherian/deploy should be a symlink to a dir inside /srv/deployment/kartotherian/deploy-cache/revs/ named after the commit: c798a573ba54e4ddcc81fc7d8f7a1dc27204d6fb [18:52:38] any ideas? [18:53:08] RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 24.643 second response time [18:53:09] ottomata: you may need to mark it as online first [18:53:13] * yurik checks [18:53:22] PDMakeGood [18:53:22] ? [18:53:23] -PDOnline -PhysDrv [E:S] -aN [18:53:25] ohh [18:53:25] k [18:53:40] not sure but if it doesnt show as online and ready that sounds likely [18:53:42] hm same [18:53:43] Adapter: 0: Failed to change PD state at EnclId-32 SlotId-5. [18:53:52] megacli -PDOnline -PhysDrv[32:5] -a0 [18:54:08] maybe Spun Up is the wrong state? [18:54:13] maybe it has to be offline before I can change state? [18:54:51] nope, -PDOffline does same thing [18:54:55] thcipriani, yep, all's good, continuing [18:55:05] kk [18:55:26] well, you can replace a missing disk wiht MegaCli -PdReplaceMissing -PhysDrv [E:S] -ArrayN -rowN -aN [18:55:27] RECOVERY - puppet last run on elastic1012 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:55:36] ottomata: lemme send you this cheat sheet real quick [18:55:38] ok [18:56:01] robh [18:56:02] -ArrayN -rowN -aN [18:56:02] thcipriani, so yeah, it seems ideally (at least for me), would be automatic deployment, restart, and test on canary-only, then prompt me to continue, then do the rest of the servers [18:56:12] is array the adapter #? [18:56:17] no that is -a [18:56:22] ok, don't know what array or row are in that [18:56:29] thcipriani, after continuing, it shouldn't deploy or restart the conary ones [18:56:44] -a is adapter, array is the array number from the output of the arrya list [18:56:58] starting with 0 [18:57:01] yurik: kk, so only service restart on the canary? All others you want to restart manually? [18:57:05] so if there is only one disk array, its 0 [18:57:12] ottomata: you should have the sheet in your email [18:57:30] thcipriani, no, i do want them to restart automatically, but only after i manually check that the service works on canary [18:57:38] ottomata: it has a revuilding an array section [18:57:40] so automatic restart only on the servers that just deployed [18:57:43] which has a step by step on what to run [18:57:56] i refer to it all the damned time. [18:58:05] ottomata: since you are still here, can i bug you to repool scb1002 services? [18:58:20] thcipriani, 1) canary-only: deploy, restart, test, wait for manual testing. 2) all others - deploy/restart/test [18:58:45] robh and row? [18:58:48] mobrovac: gimme a sec [18:58:54] kk thnx [18:59:30] also 0 since jbod? [18:59:45] !log switched kartotherian to scap3, deployed, restarted [18:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:59:55] mobrovac: you sure you ready? i am about to disasppear [18:59:58] and pass out [19:00:07] thcipriani, ok, lets do tilerator - should be much easier now because there i don' tcare about canary [19:00:19] ottomata: pass out, ori offered to help out [19:00:23] yurik: yup, so if you uncomment the service_name and service_port it will do: full deploy to map-test2001.codfw.wmnet, restart, check that the port is up, then wait for you to say 'y' to continue, then it will do it for all other servers. [19:00:34] what's not intuitive about "confctl select dc=eqiad,service=apache2,name=mw1018.eqiad.wmnet set/pooled=no "? [19:00:37] ottomata: not sure on row, the output for checking the virtual disk show tell that i think [19:00:54] thcipriani, i thought you said that it will restart all of them after deploying just to canary [19:01:10] ok [19:01:29] !log ori@palladium conftool action : set/pooled=yes; selector: name=scb1002.eqiad.wmnet [19:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:01:35] yurik: ahh, no. It will deploy and restart the service on just the canary, wait for you, then deploy the others and then restart the others [19:01:36] ottomata: if you cannot figure out the row lemme know and i'll hop on [19:01:38] mobrovac: ^ [19:01:38] RECOVERY - jenkins_zmq_publisher on gallium is OK: TCP OK - 0.000 second response time on port 8888 [19:01:43] just rying to avoid doing that right now, mid quote review ;] [19:01:51] robh i don't see anything about row (or array) in sudo megacli -PDList -aAll [19:01:56] thcipriani, ah, awesome, then its exactly what i want :D [19:02:02] yurik: :D [19:02:06] (am also looking for email...:) ) [19:02:07] ori: thnx! [19:02:13] thcipriani, will it re-restart the canary ones the second time? [19:02:26] if the server is listed twice [19:02:32] in canary and in regular targets [19:03:06] yurik: no, it'll only restart the canary the one time even though it's listed in both. The earliest group that a server is in is when it gets restarted. [19:03:07] RECOVERY - puppetmaster https on labcontrol1001 is OK: HTTP OK: Status line output matched 400 - 333 bytes in 7.989 second response time [19:03:35] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2365441 (10Dzahn) here's another working one for the second type: https://gerrit.wikimedia.org/r/#/c/293221/4/module... [19:03:51] yurik: if you notice in the output under == CANARY == it lists maps-test2001.codfw.wmnet but under == DEFAULT == maps-test2001 is not there. [19:03:53] thcipriani, awesome, thx. next question :) is it possible to restart two services with scap3? [19:03:54] https://gerrit.wikimedia.org/r/#/c/285979/2/scap/scap.cfg [19:04:29] there it lists "tilerator", but i would also like to restart "tileratorui", testable on port 6535 [19:04:46] oh maybe i have to clear foreign state [19:04:49] Foreign State: Foreign [19:04:49] Foreign Secure: Drive is not secured by a foreign lock key [19:05:00] yurik: no multi-service restart without custom checks yet, but there's a ticket: https://phabricator.wikimedia.org/T130361 [19:05:11] doing that [19:05:26] thcipriani, awesome, thx. gehel, i will add maps* to tilerator scap config and merge [19:05:29] ah! robh ogt it [19:05:31] got it [19:05:34] had to clear foreign state [19:05:43] ohhh, yes [19:05:53] if the new disk was ever in another raid it would have that old raid data [19:06:06] so you just told the controller to clear that cruft out [19:06:10] yurik: multi-service restart is currently achievable with https://doc.wikimedia.org/mw-tools-scap/scap3/quickstart/setup.html#command-checks [19:06:49] yurik, thcipriani do you know where the other puppet patch to merge is? I should be able to do that right now. [19:07:02] yurik, thcipriani: dinner is calling... [19:07:10] gehel: https://gerrit.wikimedia.org/r/#/c/291268/ [19:07:57] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2365455 (10hashar) [19:08:18] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [19:08:29] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2364905 (10hashar) I have split... [19:08:30] mobrovac: ^ ? [19:08:34] gehel: if you've got to run, we're over time on our deployment window :(( [19:08:45] ori: yup, that's me, known [19:08:52] ok :) [19:09:04] thcipriani, gehel, almost done rebuilding tilerator, should be done within a few min, and will go much faster this time [19:09:09] no need for elaborate checks [19:09:49] yurik, thcipriani: I can still stay a few minutes, and I'll be not far away from the keyboard... [19:09:55] ok [19:09:57] kk [19:10:30] robh thanks, looking good [19:10:33] i'm copying old data back on [19:10:41] awesome! [19:10:42] gotta run home, will check up on this in a sec, will probalby take a while to copy [19:11:35] thcipriani, gehel - tilerator on tin is ready for init [19:11:45] should i do it, or should i wait for the puppet run? [19:11:53] I blotched my rebase of the puppet patch, as always... [19:12:15] fun :) [19:12:37] yurik: you can run: scap deploy --init now, you'll have to wait for the puppet run on the targets before you can run a deploy, but the puppet run on tin is more of a house-cleaning thing. [19:12:57] done [19:14:11] (03PS1) 10Gergő Tisza: Enable AuthManager in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293357 (https://phabricator.wikimedia.org/T135504) [19:15:07] (03CR) 10Ori.livneh: "commit message says group1, change says group2?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293357 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [19:15:10] (03PS3) 10Gehel: Scap3 config for tilerator [puppet] - 10https://gerrit.wikimedia.org/r/291268 (https://phabricator.wikimedia.org/T129146) (owner: 10Thcipriani) [19:15:31] (03PS1) 10Bartosz Dziewoński: Update cross-wiki upload configuration for I2489004271078a [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293358 [19:15:59] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2365483 (10hashar) 18:50 mark poked me stating that the raid rebuild is complete and gallium reb... [19:16:04] yurik, thcipriani: could you give a quick look at https://gerrit.wikimedia.org/r/#/c/291268/ ? Getting tired, I'd prefer a second pair of eyes (or brain) [19:16:13] * thcipriani looks [19:16:59] (03CR) 10Gergő Tisza: "At changes from off-except-group-0 to on-except-group-2. I can change to default: true, group0: false, group1: false if you think that's l" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293357 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [19:17:46] (03CR) 10Ori.livneh: "You mean group0: false, group1: true, right? (Yes, that is less confusing.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293357 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [19:17:50] gehel: this was removed in the kartotherian patch (and should stay gone) https://gerrit.wikimedia.org/r/#/c/291268/3/hieradata/common/role/deployment.yaml tilerator should actually be removed there as well. Sorry, I should have rebased before this window :( [19:18:24] thcipriani: I was just looking at it and thinking it did not look right... [19:18:27] * yurik looks [19:18:29] 06Operations, 13Patch-For-Review: decom furud - https://phabricator.wikimedia.org/T137221#2365484 (10Dzahn) i simply repeated the same gnt-instance remove command today and it was immediately done, no issues at all.. shrug [19:18:30] :D [19:18:48] looks ok [19:19:42] (03PS2) 10Gergő Tisza: Enable AuthManager in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293357 (https://phabricator.wikimedia.org/T135504) [19:19:55] (03PS4) 10Gehel: Scap3 config for tilerator [puppet] - 10https://gerrit.wikimedia.org/r/291268 (https://phabricator.wikimedia.org/T129146) (owner: 10Thcipriani) [19:20:19] thcipriani: so this ^ should be better [19:20:53] gehel: also tilerator should be removed from hieradata/common/role/deployment.yaml [19:20:55] !log Bringing back Jenkins and Zuul on gallium T137265 [19:20:56] T137265: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265 [19:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:21:04] wish gallium disks some luck [19:21:14] (03PS3) 10Dzahn: decom furud [dns] - 10https://gerrit.wikimedia.org/r/293129 (https://phabricator.wikimedia.org/T137221) [19:21:18] thcipriani: damn, of course... [19:21:21] * gehel needs food! [19:21:26] err [19:21:28] gehel: like this: https://gerrit.wikimedia.org/r/#/c/291268/2/hieradata/common/role/deployment.yaml sorry :( [19:21:33] actually they are back .. [19:21:33] (03CR) 10Gergő Tisza: "False is "not disabled" (which is also somewhat confusing; we were trying to make it clear that AuthManager is the default in 1.27). So gr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293357 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [19:21:47] (03PS5) 10Gehel: Scap3 config for tilerator [puppet] - 10https://gerrit.wikimedia.org/r/291268 (https://phabricator.wikimedia.org/T129146) (owner: 10Thcipriani) [19:22:26] (03CR) 10Thcipriani: [C: 031] Scap3 config for tilerator [puppet] - 10https://gerrit.wikimedia.org/r/291268 (https://phabricator.wikimedia.org/T129146) (owner: 10Thcipriani) [19:22:45] is zuul still dead? :( [19:22:54] thcipriani: if it isnt right this time, I'm giving up... empty stomach does not work well with me :( [19:23:10] gehel: looks good to me :) [19:23:19] (03CR) 10Dzahn: [C: 032] decom furud [dns] - 10https://gerrit.wikimedia.org/r/293129 (https://phabricator.wikimedia.org/T137221) (owner: 10Dzahn) [19:23:31] (03CR) 10Gehel: [C: 032 V: 032] "Jenkins still out, but multiple eyes checked the code..." [puppet] - 10https://gerrit.wikimedia.org/r/291268 (https://phabricator.wikimedia.org/T129146) (owner: 10Thcipriani) [19:23:39] thcipriani, yurik: ok, merging! [19:23:47] \o/ [19:23:50] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2365518 (10hashar) [19:23:52] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2365519 (10hashar) [19:23:55] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Port Zuul package 2.1.0-95-g66c8e52 from Precise to Jessie - https://phabricator.wikimedia.org/T137279#2365520 (10hashar) [19:24:07] gehel: jenkins is back [19:24:12] just now [19:24:23] mutante: too late :) [19:24:37] yep:) just fyi and go eat! [19:24:43] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: / on gallium is read only, breaking jenkins - https://phabricator.wikimedia.org/T137265#2362975 (10hashar) 05Open>03Resolved gallium reboot apparently went with Zuul / Jenkins up s... [19:25:11] mutante: thanks anyway! Now let's see if I've broken the build with my non verified commits... [19:25:14] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Port Zuul package 2.1.0-95-g66c8e52 from Precise to Jessie - https://phabricator.wikimedia.org/T137279#2363409 (10hashar) [19:25:39] thcipriani: so I still need to run puppet on tin and on maps servers, correct? [19:25:51] gehel: correct [19:26:01] running... [19:26:36] 06Operations, 13Patch-For-Review: decom furud - https://phabricator.wikimedia.org/T137221#2365542 (10Dzahn) 05Open>03Resolved [19:26:45] thcipriani: you were right yesterday, lucky we booked a full hour :P [19:27:42] (03CR) 10Anomie: [C: 031] "PS1 and PS2 both look sane to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293357 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [19:29:20] !log gallium enabling puppet again now that zuul/jenkins are back [19:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:14] thcipriani, where should I add this - its the second half of the styling fix - https://gerrit.wikimedia.org/r/#/c/293360/ [19:30:16] puppet has run on tin && all maps servers... [19:30:24] * gehel is running to the kitchen! [19:30:32] !log change-prop deploying 08a1b1d [19:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:44] yurik, thcipriani: ping me if you need me, I'll be not too far away... [19:30:51] gehel: kk, thank you! [19:30:58] gehel, is it done? can i depl now? [19:30:59] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [19:31:23] :) [19:31:27] yurik: you should be clear to deploy: all servers with service restarts: scap deploy -v [19:31:37] awesome! [19:32:01] thcipriani, it went boom ;( [19:32:02] well. that didn't go so well. [19:32:13] 06Operations: decom furud - https://phabricator.wikimedia.org/T137221#2365564 (10Dzahn) [19:32:14] :) [19:32:24] i think its because testing might not be as well defined for it? [19:32:44] yurik: can you check the target machines to see if /srv/deployment/tilerator is owned by deploy-service? [19:32:56] checking... [19:33:32] thcipriani, no, root [19:33:42] only kartotherian is owned by deploy-service [19:34:40] gehel: help please, seems like there may have been a problem with the puppet run on the tilerator targets. [19:37:02] scap::target should change the ownership of /srv/deployment/tilerator to deploy-service if the folder is owned by root as part of https://gerrit.wikimedia.org/r/#/c/291268/ [19:38:26] the relevant bit of puppet is here https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/manifests/target.pp#L113 [19:41:11] thcipriani, should i perform rollback? [19:41:21] (y/n) [19:41:44] yurik: just ctrl-c. deploy failed pretty early in this instance and there's nothing to rollback to yet :) [19:41:56] ok [19:42:53] in normal operation, it'll restart your service, check that the port is accepting connections, run any custom checks you've defined and if any of that fails, it prompts you to rollback to the previously deployed code. [19:43:29] which is neat, but is not useful on the first deploy. [19:45:25] thcipriani, is there a way i can tell it to do a rollback even if it thinks everything is fine? [19:45:29] e.g. after canary depl [19:46:07] yurik: deploy will finish, then just run a new deploy and specify a revision with -r [rev-to-deploy] [19:46:49] thcipriani, but if i do the canary deploy only, and its waiting for me to say 'y' to continue, shouldn't i be able to do a rollback right there? [19:46:50] so, no, I guess :) [19:47:11] after all, that's the whole point about canary :) [19:47:31] canary must die! [19:47:37] oh wait, that's something else [19:47:40] ah, yes, we talked about that I *think* that's what it does, but I'd have to doublecheck the code to be sure. If that's not what it does, you're right, that's what it should do. [19:49:27] yurik: well. shoot. I imagine puppet will catch up eventually on those boxes (or there's an error that I don't have permission to see). [19:49:56] * yurik loves security... its very good at preventing all sort of bad things... like doing work :-P [19:51:27] is there an opsen who can check the puppet log for maps-test200[1-4].codfw.wmnet and maps200[1-4].codfw.wmnet see if there are errors? [19:51:35] thcipriani, don't worry, tilerator can totally wait on back burner, especially because i can run it in user space when i need it ;) [19:51:44] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [19:51:56] 06Operations, 06Mobile-Apps, 10Traffic, 06Wikipedia-iOS-App-Product-Backlog: Wikipedia app hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2365653 (10Dbrant) [19:52:11] yurik: kk, well, I guess we'd see puppet failures via icinga, so I assume they'll all catch up and we can try this again tomorrow? [19:52:31] in the meantime, seems like we got CI back and I can merge your SWAT patches from this morning. [19:52:46] thcipriani, awesome! there is one more there :0 [19:52:47] :) [19:53:19] https://gerrit.wikimedia.org/r/#/c/293360/ [19:54:17] thcipriani, ^ - i added it to depl window [19:54:22] yurik: kk, thanks [19:55:10] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293310 (owner: 10Krinkle) [19:55:33] thcipriani, thanks for your patience! :D [19:56:12] * gehel is back and reading ... [19:56:35] yurik: thanks for porting your projects to scap3! [19:56:51] ...now we just have to talk about graphoid :) [19:56:54] hehe, anything is better than dealing with salt ;) [19:57:00] oh, graphoid hasn't ported? [19:57:08] bummer, sorry, ready when you are :0 [19:57:39] np, I can make some puppet patches for it here shortly. Good to know that there's not a blocker :) [19:58:32] gehel: tl;dr, I think that puppet should have chowned /srv/deployment/tilerator on the maps boxes but it didn't seem to [19:58:47] thcipriani: sorry, I did no tsee your ping earlier [19:59:12] thcipriani, yurik : running puppet again on maps-test2001 to check for errors... [19:59:19] thank you! [19:59:26] didn't see anything come through icinga [20:00:36] thcipriani: no error, but /srv/deployment/tilerator/ still owned by root. Did I miss something in the rebase of the puppet patch? Could you check it with me again? [20:01:10] * thcipriani looks again [20:01:53] hmm, this should be it: https://gerrit.wikimedia.org/r/#/c/291268/5/modules/tilerator/manifests/init.pp [20:03:20] that change should trigger this: https://github.com/wikimedia/operations-puppet/blob/production/modules/service/manifests/node.pp#L127-L137 which should trigger https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/manifests/target.pp#L113-L133 [20:03:33] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160608T2000). [20:03:53] nothing to deploy for parsoid [20:03:53] jouncebot should use more commas [20:04:22] * subbu senses a ori patch landing [20:04:51] already deploying :) [20:04:59] thcipriani, btw, we can do graphoid as well :) [20:05:19] ori: jounce bot is a lazy typist ;) [20:05:30] yurik: maybe not today, still have a train to guide. [20:05:51] ok [20:06:07] let me know when you get the patches out [20:06:17] bd808: not consistently lazy: deploypage.py:172: ", ".join(self.deployers), , but jouncebot.py:152: deployers = (u" ".join(event.deployers)) [20:06:40] yurik: will do. [20:07:19] thcipriani: I'm trying to read that code, but don't understand why it is not working yet. [20:07:19] 06Operations, 10ops-eqiad: Rack/Setup (18) new memcache Servers - https://phabricator.wikimedia.org/T137345#2365691 (10Cmjohnson) [20:07:20] ori: I'll merge and deploy if you put up a fix [20:07:27] 06Operations, 10ops-eqiad: Rack/Setup (18) new memcache Servers - https://phabricator.wikimedia.org/T137345#2365704 (10Cmjohnson) [20:07:28] working on it :) [20:08:30] gehel: yurik well, if it hasn't chowned that directory for deploy-service and there aren't any errors in the puppet log then I guess I'll have to dig a bit on it. I'm running a bit behind on the train, so I suppose we can revert for now and try it again later. [20:08:51] thcipriani: ok, I'll revert... [20:08:59] gehel, not kartotherian though ;) [20:09:03] gehel: thank you for all your help! [20:09:30] 06Operations, 10ops-eqiad: Rack/Setupnew memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2365709 (10Cmjohnson) [20:09:30] thcipriani: no problem. I've learned stuff along the way! Thanks for your patience! [20:09:40] :D [20:09:45] yurik: nope, just reverting https://gerrit.wikimedia.org/r/#/c/291268/5 [20:09:55] speaking of train, anyone know about the uncommitted changes in wmf.4 in the Math extension? [20:09:57] 06Operations, 10ops-eqiad: Rack/Setupnew memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2365691 (10Cmjohnson) [20:13:04] (03PS1) 10Gehel: Revert "Scap3 config for tilerator" [puppet] - 10https://gerrit.wikimedia.org/r/293375 [20:15:37] yurik: syncing your kartographer updates now [20:15:43] awesome! [20:16:19] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:18:00] !log thcipriani@tin Synchronized php-1.28.0-wmf.4/extensions/Kartographer: late SWAT: [[gerrit:293251|Fix color extraction]] (duration: 00m 36s) [20:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:04] thcipriani: are you updating group1 today? [20:18:05] ^ yurik check please [20:18:26] thcipriani, did you do just one or both? [20:18:41] yurik: just wmf.4 right now, I'll get the other in a second. [20:18:52] thcipriani, thx, checking [20:19:01] tgr: that is still the plan, but I'm running late, still blockers to deal with. [20:19:23] thcipriani, awesome, works! [20:20:14] (03PS1) 10Gehel: Fixing lint issues [puppet] - 10https://gerrit.wikimedia.org/r/293377 [20:21:11] 06Operations, 10Mail: not being able to send emails via Special:EmailUser - https://phabricator.wikimedia.org/T137337#2365727 (10Luke081515) It seems like this is an wikimedia issue, and related to fawiki, since personally my wikimail today (from dewiki) got delivered. [20:21:37] !log thcipriani@tin Synchronized php-1.28.0-wmf.5/extensions/Kartographer/styles/kartographer.less: [[gerrit:293360|Fixed autostyling]] (duration: 00m 26s) [20:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:21:45] ^ yurik there is the less file [20:21:55] thx! [20:22:18] thcipriani, lets try to do finish the graphoid & tilerator tomorrow :) [20:22:30] yurik: kk, sounds like a plan, thanks :) [20:23:38] Krinkle: is this patch good to go https://gerrit.wikimedia.org/r/#/c/293363/ ? [20:24:40] tgr: is there something to backport for wmf.5 for https://phabricator.wikimedia.org/T135656 ? [20:25:15] thcipriani: yeah, the patch at the end [20:25:25] not sure if it fixes it [20:25:57] there are two errors, this fixes one, I'm hoping the other was a consequence somehow [20:26:29] thcipriani: I guess so, but it's not been tested yet due to Jenkins issuses [20:26:29] kk https://gerrit.wikimedia.org/r/#/c/293383/ [20:27:05] thcipriani: the 'relevant' jobs did run and pass, though. [20:27:15] The qunit failure are unrelated and false failurs due to Jenkins infra [20:27:50] (03CR) 10Mobrovac: [C: 031] Add ssh:userkey for eventlogging scap::targets [puppet] - 10https://gerrit.wikimedia.org/r/293217 (https://phabricator.wikimedia.org/T137192) (owner: 1020after4) [20:27:59] Krinkle: ack. I'll +2 and let it run through its tests. [20:30:59] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:33:27] ah, blerg. is nodepool not working now? [20:33:51] bd808: making it nice and overly complicated [20:33:58] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 1.595 second response time [20:36:59] (03PS1) 10Ori.livneh: Use commas to punctuate sequences of names [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/293418 [20:37:05] bd808: ^ [20:37:35] heh [20:37:46] (03PS1) 10Ladsgroup: ores.wikimedia.org instead of ores.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293419 [20:39:33] ^ anyone who can merge this, super trivial. :) [20:41:17] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [20:42:16] akosiaris: hey, I want to fix the race condition in ores but I need someone to merge https://gerrit.wikimedia.org/r/292516 once I got the fix deployed in staging, labs, and prod. Do you have some time? [20:43:39] tgr: could you double-check me on this backport? https://gerrit.wikimedia.org/r/#/c/293383/ [20:44:33] (03CR) 10jenkins-bot: [V: 04-1] Revert "Scap3 config for tilerator" [puppet] - 10https://gerrit.wikimedia.org/r/293375 (owner: 10Gehel) [20:45:47] 06Operations, 10netops, 10Continuous-Integration-Infrastructure (phase-out-gallium): install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#2365862 (10hashar) 05Open>03Resolved Due to gallium loosing a disk ( T137265 ) @Joe allocated a new server from the pool.... [20:47:38] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:49:37] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 7.889 second response time [20:49:49] (03PS4) 10Dzahn: DNS: Add prod DNS for mw2215-mw2250 and removed the old mw entries mw2001-mw2016/mw2018-mw2060 Bug:T135466 [dns] - 10https://gerrit.wikimedia.org/r/292307 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [20:49:57] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [20:50:06] !log thcipriani@tin Synchronized php-1.28.0-wmf.5/includes/libs/objectcache/WANObjectCache.php: [[gerrit:293363|Avoid getWithSetCallback() warnings on unversioned key migration]] (duration: 00m 24s) [20:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:54:08] (03CR) 10BryanDavis: "recheck" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/293418 (owner: 10Ori.livneh) [20:55:25] (03CR) 10Gehel: [C: 032] Fixing lint issues [puppet] - 10https://gerrit.wikimedia.org/r/293377 (owner: 10Gehel) [20:56:57] !log thcipriani@tin Synchronized php-1.28.0-wmf.5/extensions/Renameuser/RenameuserSQL.php: [[gerrit:293383|Use master DB when touching the user to signal rename end]] (duration: 00m 22s) [20:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:57:01] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2365897 (10Papaul) @jcrespo here is the reply back from the Dell support team so I will need another down time on those systems tomorrow or Friday. Hello Papaul, We have just found a fix fo... [20:57:08] ^ tgr backport is sync'd, FYI [20:57:16] (03CR) 10Dzahn: [C: 032] DNS: Add prod DNS for mw2215-mw2250 and removed the old mw entries mw2001-mw2016/mw2018-mw2060 Bug:T135466 [dns] - 10https://gerrit.wikimedia.org/r/292307 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [20:58:00] (03PS2) 10Gehel: Revert "Scap3 config for tilerator" [puppet] - 10https://gerrit.wikimedia.org/r/293375 [20:58:01] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:58:11] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp main page via mobile-sections-lead) is CRITICAL: Could not fetch url http://mobileapps.svc.eqiad.wmnet:8888/en.wikipedia.org/v1/page/mobile-sections-lead/Main_Page: Timeout on connection while downloading http://mobileapps.svc.eqiad.wmnet:8888/en.wikipedia.org/v1/page/mobile-sections- [20:58:11] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [20:58:17] what? [20:58:25] (03CR) 10Dzahn: "The pybal config in /srv on palladium that i looked at is not actually used anymore. When checking with conftool we could confirm they ar" [dns] - 10https://gerrit.wikimedia.org/r/292307 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [20:58:35] changeprop is known, looking into the others [20:59:13] (03CR) 10BryanDavis: [C: 032] Use commas to punctuate sequences of names [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/293418 (owner: 10Ori.livneh) [20:59:36] mobileapps is a false negative [21:00:01] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [21:00:01] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [21:01:43] (03Merged) 10jenkins-bot: Use commas to punctuate sequences of names [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/293418 (owner: 10Ori.livneh) [21:04:00] !log thcipriani@tin Synchronized php-1.28.0-wmf.5/includes/specials/SpecialSearch.php: [[gerrit:293422|Add a visual clear to Special:Search input box and profile-tabs]] (duration: 00m 23s) [21:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:04:12] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [21:06:22] alright, *now* going to try to roll wmf.5 forward. [21:06:53] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [21:08:20] (03PS1) 10Thcipriani: group1 wikis to 1.28.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293427 [21:08:42] thcipriani: rename seems fixed [21:09:05] (03CR) 10Alex Monk: "in labs...?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293419 (owner: 10Ladsgroup) [21:09:08] tgr: awesome! thank you for checking. [21:09:42] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:10:01] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:10:09] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.28.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293427 (owner: 10Thcipriani) [21:10:46] (03Merged) 10jenkins-bot: group1 wikis to 1.28.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293427 (owner: 10Thcipriani) [21:11:19] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.5 [21:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:11:28] he so we just removed old appservers from DNS while you were deploying but things are fine [21:11:46] deploy :) [21:12:01] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [21:12:11] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Could not fetch url http://mobileapps.svc.eqiad.wmnet:8888/en.wikipedia.org/v1/page/media/Cat: Timeout on connection while downloading http://mobileapps.svc.eqiad.wmnet:8888/en.wikipedia.org/v1/page/media/Cat [21:12:30] mobileapps? [21:12:36] (03CR) 10Ladsgroup: "That's the ores extension setting for beta wikis. For now, we are checking to see if the prod environment can work with the ores extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293419 (owner: 10Ladsgroup) [21:12:45] oh, yeah, scap reported no errors so I assume it was able to contact all servers it knows about. [21:13:14] ok, yes, good [21:13:27] they were already shutdown anyways [21:13:32] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [21:13:37] worst case would have been a remnant in dsh group file [21:13:52] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [21:14:00] and that sems good too. alright [21:14:12] welcome back jouncebot [21:14:20] jouncebot: next [21:14:20] In 0 hour(s) and 45 minute(s): AuthManager (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160608T2200) [21:14:29] (03CR) 10Gehel: [C: 032] Revert "Scap3 config for tilerator" [puppet] - 10https://gerrit.wikimedia.org/r/293375 (owner: 10Gehel) [21:15:00] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2365971 (10hashar) [21:15:16] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2364905 (10hashar) [21:15:18] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2365985 (10hashar) [21:15:26] !log change-prop reverting back to 96337cd540a2 [21:16:38] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2365991 (10hashar) [21:17:01] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:17:02] yurik, thcipriani: scap3 for tilerator finally reverted (jenkins is still dead slow, and there was a few lint errors to correct) [21:17:09] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2365994 (10hashar) [21:17:11] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2223557 (10hashar) [21:17:32] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2223557 (10hashar) With gallium that lost a disk today, we had contint1001.eqiad.wmnet allocated (Jessie and private IP). Switching services to... [21:17:43] (03PS8) 10Madhuvishy: uwsgi: Allow specifying plugins as a uwsgi command line option [puppet] - 10https://gerrit.wikimedia.org/r/292030 [21:18:19] gehel: thank you! sorry that took so long :( [21:18:23] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Port Zuul package 2.1.0-95-g66c8e52 from Precise to Jessie - https://phabricator.wikimedia.org/T137279#2366001 (10hashar) [21:18:25] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2365971 (10hashar) [21:18:55] thcipriani: nothing you could have done about it... [21:19:05] (03PS5) 10Hashar: contint: cleanup gallium / use contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/293283 (https://phabricator.wikimedia.org/T137358) [21:19:38] (03PS4) 10Hashar: cache_misc: change doc/integration.wm.o backend [puppet] - 10https://gerrit.wikimedia.org/r/293284 (https://phabricator.wikimedia.org/T137358) [21:21:32] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:41] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [21:24:45] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2366095 (10hashar) [21:25:42] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [21:26:38] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2365971 (10hashar) https://gerrit.wikimedia.org/r/#/c/293283/ against puppet.git is a beast it basically change all occurr... [21:26:56] (03CR) 10Hashar: "Now associated with task T137358" [puppet] - 10https://gerrit.wikimedia.org/r/293283 (https://phabricator.wikimedia.org/T137358) (owner: 10Hashar) [21:27:42] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [21:28:52] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [21:31:23] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [21:35:47] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:36:27] (03PS1) 10Ori.livneh: xhgui: provision xhprof UI [puppet] - 10https://gerrit.wikimedia.org/r/293428 [21:36:41] (03PS2) 10Ori.livneh: xhgui: provision xhprof UI [puppet] - 10https://gerrit.wikimedia.org/r/293428 [21:36:56] (03CR) 10Ori.livneh: [C: 032 V: 032] xhgui: provision xhprof UI [puppet] - 10https://gerrit.wikimedia.org/r/293428 (owner: 10Ori.livneh) [21:37:27] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [21:40:39] (03PS1) 10Ladsgroup: ores: Add redis settings to worker nodes in labs [puppet] - 10https://gerrit.wikimedia.org/r/293429 [21:43:15] (03PS1) 10Ori.livneh: apache config for xhprof UI [puppet] - 10https://gerrit.wikimedia.org/r/293430 [21:43:29] (03CR) 10Ori.livneh: [C: 032 V: 032] apache config for xhprof UI [puppet] - 10https://gerrit.wikimedia.org/r/293430 (owner: 10Ori.livneh) [21:45:56] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.006 second response time [21:47:47] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.023 second response time [21:53:09] (03CR) 10Ladsgroup: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293419 (owner: 10Ladsgroup) [21:53:32] (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/292516 (owner: 10Ladsgroup) [21:55:48] (03PS2) 10Krinkle: Bump wgResourceLoaderStorageVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293310 [21:56:02] (03CR) 10Krinkle: [C: 032] Bump wgResourceLoaderStorageVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293310 (owner: 10Krinkle) [21:56:43] (03Merged) 10jenkins-bot: Bump wgResourceLoaderStorageVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293310 (owner: 10Krinkle) [21:59:28] tgr: Dear anthropoid, the time has come. Please deploy AuthManager (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160608T2200). [22:00:56] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: Bump wgResourceLoaderStorageVersion (T134368) (duration: 00m 28s) [22:00:57] T134368: Image urls in CSS remain cached with old $wgResourceBasePath - https://phabricator.wikimedia.org/T134368 [22:01:13] * Krinkle verified and signs out of deployment server [22:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:03:23] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [22:04:25] !log starting kafka broker on kafka1012 after swapping disk and copying data directory [22:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:06:20] RECOVERY - Kafka Broker Server on kafka1012 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [22:19:58] !log tgr@tin Synchronized php-1.28.0-wmf.5/extensions/OpenStackManager/: backport [[gerrit:293130]] for AuthManager deploy T135504 (duration: 00m 28s) [22:19:59] T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504 [22:20:24] (03PS3) 10Gergő Tisza: Enable AuthManager in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293357 (https://phabricator.wikimedia.org/T135504) [22:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:24:30] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:24:45] (03CR) 10Gergő Tisza: [C: 032] Enable AuthManager in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293357 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [22:25:19] (03Merged) 10jenkins-bot: Enable AuthManager in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293357 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [22:26:19] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.015 second response time [22:26:55] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: enable AuthManager on group1 T135504 (duration: 00m 23s) [22:26:56] T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504 [22:27:01] anomie: ^ [22:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:32:30] tgr: It doesn't look to be enabled? [22:32:43] anomie: yeah, just noticed [22:36:08] PROBLEM - puppet last run on mw2205 is CRITICAL: CRITICAL: puppet fail [22:36:30] anomie: does 'group1' work? there are no more instances of it in wmf-config which is a bit suspicious [22:36:41] Yes, group1 works. [22:36:56] Errr, works in some ways [22:36:59] It's a .dblist file [22:37:06] I dunno if we populate it in CommonSettings tho [22:37:55] Do we have to do default: false and wikipedia: true to get the effect? [22:38:40] the contents of the file is all.dblist - group0.dblist - wikipedia.dblist + group1-wikipedia.dblist so let's go with that [22:40:34] Looking through old commits, it looks like that's the answer, yeah. [22:41:08] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.002 second response time [22:41:19] (03PS1) 10Gergő Tisza: Fix AuthManager feature switch configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293434 (https://phabricator.wikimedia.org/T135504) [22:42:26] (03CR) 10Gergő Tisza: [C: 032] Fix AuthManager feature switch configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293434 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [22:42:59] (03Merged) 10jenkins-bot: Fix AuthManager feature switch configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293434 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [22:43:07] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.032 second response time [22:44:06] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: enable AuthManager on group1 for reals T135504 (duration: 00m 25s) [22:44:07] T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504 [22:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:44:33] seems to have worked this time [22:44:40] (03CR) 10Anomie: Fix AuthManager feature switch configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293434 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [22:47:26] /13/12 [22:47:37] anomie: yeah, it's not enabled on group1 pedias [22:48:22] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2366346 (10RobH) ok, ge-3/0 is done, need to do ge-4/0 interfaces next. [22:50:59] hey, I have access to grafana-admin but I can't login to graphite.wikimedia.org Do I need another access request? [22:51:09] or I'm missing something [22:51:55] !log Re-started dumpwikidatattl on snapshot1003 [22:52:28] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2364905 (10Dzahn) for the flows... [22:52:43] mutante: hey, do you know about this ^^^ [22:52:46] Amir1: should be the same LDAP password [22:52:58] I checked and didn't work [22:53:29] I double checked now :D [22:53:33] didn't work either [22:53:57] sorry, i dont know it yet, just that it is a new group [22:54:14] did you just get new access or did you have it before [22:55:54] I just got access [22:55:59] to grafana-admin [22:56:39] is it possible to grant access to graphite for grafana-admin LDAP group? [22:57:06] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2366395 (10Dzahn) Same for labno... [22:57:28] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] [22:57:48] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] [22:57:50] i dont know [22:57:55] let me see which groups you have [22:57:58] checking [22:58:58] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:58:59] Amir1: so yea, it looks like you are in the new grup grafana-admin but you are not in the old groups wmf or nda [22:59:06] Amir1: which probably explains this [22:59:38] is it possible to grant rights to the grafana-admin group? [23:00:04] RoanKattouw, ostriches, Krenair, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160608T2300). [23:00:06] possible yes, but that would need a new request ticket [23:00:07] grphite data is not senstive (and even if it was I already signed NDA) [23:00:20] or maybe you are asking for membership in the nda group [23:00:25] mhm I guess I'll deploy [23:00:27] what else would you want to look at [23:00:34] Hello. [23:00:42] i'd have to check which tools allow which groups exactly [23:00:57] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 7.614 second response time [23:01:00] mutante: I need graphite for now [23:02:33] 06Operations, 10LDAP-Access-Requests, 07LDAP: Grant graphite.wikimedia.org rights to grafana-admin LDAP group - https://phabricator.wikimedia.org/T137373#2366461 (10Ladsgroup) [23:03:31] I was so close on making the grafana dashboard for ores in prod [23:04:04] RECOVERY - puppet last run on mw2205 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [23:04:20] mutante: if you have some time, and if it's okay for you, can you check graphite.wikimedia.org and tell me what is a ores logs look like [23:04:31] (03CR) 10Gergő Tisza: "Turns out there is no group1 group. Fix was I1fce011c162f38fb0befea0f24fe498f63a5b3f7." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293357 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [23:04:55] it should be something like ores.scb1001.scores_requests.etc.etc.etc [23:05:27] it will save several hours of mine [23:05:36] !log maxsem@tin Synchronized php-1.28.0-wmf.5/extensions/LiquidThreads/: https://gerrit.wikimedia.org/r/#/c/293247/ (duration: 00m 26s) [23:05:38] Amir1: so graphite is "nda/ops/wmf" [23:05:44] Amir1: ok [23:06:12] let me make the patch for it :D [23:06:46] i am not sure we will want to use that group name [23:06:54] for another tool [23:06:54] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] [23:07:05] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:07:05] but it can of course be figured out [23:07:06] hmm, you're right [23:07:09] let me look for the ores log [23:07:25] !log maxsem@tin Synchronized php-1.28.0-wmf.4/extensions/LiquidThreads/: https://gerrit.wikimedia.org/r/#/c/293247/ (duration: 00m 26s) [23:07:50] so i am searching for "ores" [23:08:10] and i see servers.oresrdb1001.tcp.... [23:08:24] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] [23:08:36] not yet ores.scb1001 [23:08:46] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.027 second response time [23:08:47] Amir1, check out https://github.com/wiki-ai/ores-wikimedia-config/blob/master/config/00-main.yaml#L25 [23:08:49] okay [23:08:49] That [23:09:01] We're probably sending statsd logs to the wrong endpoint [23:09:21] okay [23:09:25] halfak: let me fix it [23:09:43] thanks mutante [23:10:01] alright, np [23:10:01] I try to get the access as soon as possible [23:10:19] please do make a new request ticket [23:10:25] they get picked up for sure in next meeting [23:10:32] sure [23:10:48] (03PS1) 10Yurik: Enable wgKartographerUseMarkerStyle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293438 [23:10:52] https://phabricator.wikimedia.org/T137373 [23:11:02] (03PS1) 10Gergő Tisza: Clean up AuthManager configuration (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293440 (https://phabricator.wikimedia.org/T135504) [23:11:07] MaxSem, ^ pls deploy [23:11:17] i meant ^^^ [23:11:25] 06Operations, 10LDAP-Access-Requests, 07LDAP: Grant graphite.wikimedia.org rights to grafana-admin LDAP group - https://phabricator.wikimedia.org/T137373#2366492 (10Ladsgroup) pinging @jcrespo who made the LDAP group. [23:11:25] TIL that there is a phabricator tag just for "LDAP-Accces-Requests" nowadays [23:11:41] ok, meanwhile please add it to deployments [23:11:46] ok [23:11:55] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK: OK: Less than 50.00% above the threshold [1.0] [23:12:17] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: Grant graphite.wikimedia.org rights to grafana-admin LDAP group - https://phabricator.wikimedia.org/T137373#2366496 (10Dzahn) [23:12:22] :D [23:12:51] (03CR) 10MaxSem: [C: 032] Enable wgKartographerUseMarkerStyle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293438 (owner: 10Yurik) [23:13:30] (03Merged) 10jenkins-bot: Enable wgKartographerUseMarkerStyle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293438 (owner: 10Yurik) [23:13:56] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: Grant graphite.wikimedia.org rights to grafana-admin LDAP group - https://phabricator.wikimedia.org/T137373#2366461 (10Dzahn) It seems the questions are.. Should we create yet another group called graphite-admin? Should we re-use grafan... [23:15:10] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/293438 (duration: 00m 25s) [23:15:19] yurik, ^ [23:15:33] MaxSem, thx, I updated the deployment page [23:16:01] yep, works, thx! [23:16:16] MaxSem, now we have to regen all the markers :D [23:16:50] I think there was a null-editing bot in pywikibot [23:17:09] touch.py [23:17:32] !log maxsem@tin Synchronized php-1.28.0-wmf.5/extensions/WikimediaEvents/: https://gerrit.wikimedia.org/r/#/c/293439/ (duration: 00m 23s) [23:18:47] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: Grant graphite.wikimedia.org rights to grafana-admin LDAP group - https://phabricator.wikimedia.org/T137373#2366500 (10Ladsgroup) Either of those is fine for me. [23:21:35] (03PS1) 10Dzahn: contint: add firewall rule for nodepool to Jenkins API [puppet] - 10https://gerrit.wikimedia.org/r/293441 (https://phabricator.wikimedia.org/T137323) [23:25:15] (03CR) 10Dzahn: [C: 032] contint: add firewall rule for nodepool to Jenkins API [puppet] - 10https://gerrit.wikimedia.org/r/293441 (https://phabricator.wikimedia.org/T137323) (owner: 10Dzahn) [23:25:47] (03PS3) 10Madhuvishy: [WIP] jupyterhub: Add module to set up Jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/288086 [23:25:49] (03PS1) 10Ladsgroup: ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) [23:26:56] (03CR) 10jenkins-bot: [V: 04-1] [WIP] jupyterhub: Add module to set up Jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [23:28:34] (03PS4) 10Madhuvishy: [WIP] jupyterhub: Add module to set up Jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/288086 [23:29:26] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, and 2 others: Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2366543 (10Dzahn) |--|--|--|--|--|--|--|-- | TCP | scandium | 10.64.4.12 | contin... [23:29:49] (03CR) 10jenkins-bot: [V: 04-1] [WIP] jupyterhub: Add module to set up Jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [23:31:45] (03PS5) 10Madhuvishy: [WIP] jupyterhub: Add module to set up Jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/288086 [23:32:04] 06Operations, 06Performance-Team, 06Services, 07Availability: Create restbase BagOStuff subclass (session storage) - https://phabricator.wikimedia.org/T137272#2363223 (10GWicke) @smalyshev & I just did a bit of brainstorming. We mostly went through the requirements for session storage, and then looked at h... [23:36:12] (03PS1) 10Hoo man: Retry Wikidata dump creation up to three times [puppet] - 10https://gerrit.wikimedia.org/r/293445 [23:40:57] 06Operations, 06Performance-Team, 06Services, 07Availability: Create restbase BagOStuff subclass (session storage) - https://phabricator.wikimedia.org/T137272#2363223 (10ori) >>! In T137272#2366546, @GWicke wrote: > * Overall size: Check Redis memory size Total: 1405 Mb in use, 9000 Mb max (500 Mb x 18 se... [23:42:22] (03PS2) 10Hoo man: Retry Wikidata dump creation up to three times [puppet] - 10https://gerrit.wikimedia.org/r/293445 [23:45:20] (03PS1) 10Papaul: DNS: Add mgmt entries for mw2239-mw2250 and removed old servers Bug: T135466 [dns] - 10https://gerrit.wikimedia.org/r/293446 (https://phabricator.wikimedia.org/T135466) [23:51:08] 06Operations, 06Performance-Team, 06Services, 07Availability: Create restbase BagOStuff subclass (session storage) - https://phabricator.wikimedia.org/T137272#2366589 (10ori) >>! In T137272#2366546, @GWicke wrote: > * Request volume: each authenticated page request - volume? Crude estimates for determinin... [23:51:31] 06Operations, 10ops-codfw: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2366590 (10RobH) [23:52:00] (03PS1) 10MaxSem: Switch OSM replication to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/293448 [23:52:14] (03PS1) 10Dzahn: contint: limit access to zuul-merger git daemon [puppet] - 10https://gerrit.wikimedia.org/r/293449 (https://phabricator.wikimedia.org/T137323) [23:53:05] MaxSem: i can take that if there is no reason to wait [23:54:40] (03PS2) 10Dzahn: contint: limit access to zuul-merger git daemon [puppet] - 10https://gerrit.wikimedia.org/r/293449 (https://phabricator.wikimedia.org/T137323) [23:55:21] 06Operations, 06Performance-Team, 06Services, 07Availability: Create restbase BagOStuff subclass (session storage) - https://phabricator.wikimedia.org/T137272#2366597 (10GWicke) Thank you, @ori! This information is very helpful. [23:56:09] mutante, thanks! [23:56:30] (03CR) 10Dzahn: [C: 031] Switch OSM replication to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/293448 (owner: 10MaxSem) [23:56:39] (03CR) 10Dzahn: [C: 032] Switch OSM replication to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/293448 (owner: 10MaxSem) [23:57:20] (03PS2) 10Dzahn: DNS: Add mgmt entries for mw2239-mw2250 and removed old servers Bug: T135466 [dns] - 10https://gerrit.wikimedia.org/r/293446 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [23:58:28] 06Operations, 10ops-codfw: rack/setup/deploy mw22[1-5][0-9] switch configuration - https://phabricator.wikimedia.org/T136670#2366602 (10Papaul) @Rob adding 8 more servers in rack B4 mw2243 row B rack B4 ge-4/0/4 mw2244 row B rack B4 ge-4/0/5 mw2245 row B rack B4 ge-4/0/6 mw2246 row B rack B4 ge-4/0/7 mw2247 r... [23:59:46] (03CR) 10Dzahn: [C: 031] "just let us know when by adding your own +1 or so" [puppet] - 10https://gerrit.wikimedia.org/r/293284 (https://phabricator.wikimedia.org/T137358) (owner: 10Hashar)