[00:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150924T0000). [00:00:08] (03CR) 10Yuvipanda: "Is this still needed?" [puppet] - 10https://gerrit.wikimedia.org/r/208397 (owner: 10Hoo man) [00:00:08] yuvipanda, ori was going to provide code review [00:00:24] hoo_: jzerebecki aude is https://gerrit.wikimedia.org/r/#/c/208397/ still needed? [00:00:26] yuvipanda: sure! [00:00:36] AndyRussG: thanky ou [00:00:42] AndyRussG: are you aware of PuppetSWAT? [00:00:54] yuvipanda: is it OK if I do it like, tomorrow or Friday? [00:00:58] yuvipanda: no, what's that? [00:01:07] AndyRussG: wikitech.wikimedia.org/wiki/PuppetSWAT [00:01:13] AndyRussG: like SWAT but for puppet patches [00:01:16] yuvipanda: Didn't receive any recent reports... so presumably not [00:01:29] hoo_: can you abandon? we can then bring it back if needed [00:01:34] yuvipanda: gotcha :) nice name ;p [00:01:47] AndyRussG: so yeah, if you can do this this week you can put it up for next puppet swat on tuesday [00:01:52] cool [00:01:55] or you can also just ping me for this particular instance and I can merge it [00:02:00] PuppetSWAT can set more than wikis alight [00:02:01] (03PS1) 10EBernhardson: Refactor monolog handling to point to 1-N sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [00:02:03] (03PS1) 10EBernhardson: Maintain existing api.log format when adding context [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240616 (https://phabricator.wikimedia.org/T108618) [00:02:04] SEAT [00:02:05] (03PS1) 10EBernhardson: Send the api request log to kafka [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240617 (https://phabricator.wikimedia.org/T108618) [00:02:05] If you perform a PuppetSWAT are you a SWATPuppeteer? [00:02:43] (03CR) 10jenkins-bot: [V: 04-1] Maintain existing api.log format when adding context [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240616 (https://phabricator.wikimedia.org/T108618) (owner: 10EBernhardson) [00:02:48] (03Abandoned) 10Hoo man: Add a dedicated Wikibase job runner [puppet] - 10https://gerrit.wikimedia.org/r/208397 (owner: 10Hoo man) [00:02:53] (03PS1) 10BBlack: bugfix to 0010-varnishd-no-size-reduce [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/240618 [00:02:55] (03PS1) 10BBlack: minor bugfix to 0010-varnishd-file-fallocate [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/240619 [00:02:57] (03PS1) 10BBlack: Add libvmod-tbf as debian patches (+control update for libdb-dev) [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/240620 [00:02:58] AndyRussG: I've always thought of it as https://en.wikipedia.org/wiki/SWAT_Kats:_The_Radical_Squadron [00:02:59] (03PS1) 10BBlack: Add libvmod-ipcast as debian patches [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/240621 [00:03:01] (03PS1) 10BBlack: varnish (3.0.6plus-wm8) jessie-wikimedia; urgency=low [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/240622 [00:03:03] (03CR) 10jenkins-bot: [V: 04-1] Send the api request log to kafka [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240617 (https://phabricator.wikimedia.org/T108618) (owner: 10EBernhardson) [00:03:07] hoo_: thanks [00:03:43] (03CR) 10Yuvipanda: "Can someone redo this and update commit message to provide more info? Could also be a PuppetSWAT candidate." [puppet] - 10https://gerrit.wikimedia.org/r/204996 (owner: 10Legoktm) [00:03:49] ohhhh hmmm /me debates whether he should show this to his kids [00:04:58] (03PS7) 10Dzahn: squid: logrotate for webproxy on carbon [puppet] - 10https://gerrit.wikimedia.org/r/239009 (https://phabricator.wikimedia.org/T97119) [00:05:01] AndyRussG: I enjoyed it immensely not too long ago [00:05:15] They do like cats! [00:05:45] (03Abandoned) 10Yuvipanda: Set thumbnails varnish TTL to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/218858 (https://phabricator.wikimedia.org/T77697) (owner: 10Gilles) [00:06:42] (03Abandoned) 10Yuvipanda: Set group for /srv/mediawiki on singlenode mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/79955 (https://bugzilla.wikimedia.org/72046) (owner: 10Mattflaschen) [00:07:35] (03CR) 10BBlack: [C: 032] bugfix to 0010-varnishd-no-size-reduce [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/240618 (owner: 10BBlack) [00:07:44] (03CR) 10BBlack: [C: 032] minor bugfix to 0010-varnishd-file-fallocate [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/240619 (owner: 10BBlack) [00:07:47] hoo_: https://gerrit.wikimedia.org/r/#/c/113755/? [00:07:53] (03CR) 10BBlack: [C: 032] Add libvmod-tbf as debian patches (+control update for libdb-dev) [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/240620 (owner: 10BBlack) [00:07:56] yuvipanda, ori was going to provide code review [00:07:59] which patch? [00:08:01] (03CR) 10BBlack: [C: 032] Add libvmod-ipcast as debian patches [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/240621 (owner: 10BBlack) [00:08:09] (03CR) 10BBlack: [C: 032] varnish (3.0.6plus-wm8) jessie-wikimedia; urgency=low [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/240622 (owner: 10BBlack) [00:08:25] !log varnish package on carbon for jessie updated to 3.0.6plus-wm8 [00:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:31] ori, https://gerrit.wikimedia.org/r/#/c/236500/ [00:08:43] Krenair: i'll review now, sorry [00:08:44] Reedy: still want https://gerrit.wikimedia.org/r/#/c/231889/? [00:08:51] yuvipanda: I *planned* to work on that at some point, to make me remember it, I kept it open [00:09:03] should problably rewrite it in php/python/perl/… [00:09:14] hoo_: yes, my personal opinion is to write it in python [00:09:36] basically if you're using conditionals in bash and your script is more than 5 lines long then do not use bash :P [00:09:41] Yeah, it's standard library is very handy for that [00:09:48] :D [00:09:49] (03CR) 10Yuvipanda: "Is this still needed?" [puppet] - 10https://gerrit.wikimedia.org/r/231889 (https://phabricator.wikimedia.org/T109195) (owner: 10Reedy) [00:09:50] Krenair: have you tested it? (the python changes, specifically) [00:10:07] I might have at some point [00:10:13] hoo_: :D let's abandon it? [00:10:35] Krenair: ok, I'll merge and see if anything breaks; the stakes are low. [00:10:56] (03CR) 10Dzahn: "yes, the bug is still open and needs Physikerwelt" [puppet] - 10https://gerrit.wikimedia.org/r/231889 (https://phabricator.wikimedia.org/T109195) (owner: 10Reedy) [00:11:06] yuvipanda: Ok :P I'll pin a post it to my screen so that it's not totally impossible that I ever pick it up again [00:11:19] (03PS7) 10Ori.livneh: tcpircbot: Also take input from files [puppet] - 10https://gerrit.wikimedia.org/r/236500 (owner: 10Alex Monk) [00:11:28] hoo_: P [00:12:03] (03CR) 10Ori.livneh: [C: 032] tcpircbot: Also take input from files [puppet] - 10https://gerrit.wikimedia.org/r/236500 (owner: 10Alex Monk) [00:12:44] (03Abandoned) 10Hoo man: toollabs/sql: Fix argument forwarding (-v breaks mysql) and clean up [puppet] - 10https://gerrit.wikimedia.org/r/113755 (owner: 10Hoo man) [00:13:38] (03CR) 10Dzahn: "actually... ok. i added Physikerwelt here." [puppet] - 10https://gerrit.wikimedia.org/r/231889 (https://phabricator.wikimedia.org/T109195) (owner: 10Reedy) [00:14:50] PROBLEM - logstash process on logstash1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [00:15:16] ugh [00:15:21] mutante: ^ can you take a look? [00:15:27] * yuvipanda is preppint to send out toollabs survey [00:15:48] logstash? ok [00:16:15] (03Abandoned) 10Dzahn: Install texlive-extra-utils on mw appservers [puppet] - 10https://gerrit.wikimedia.org/r/231889 (https://phabricator.wikimedia.org/T109195) (owner: 10Reedy) [00:16:49] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [00:16:56] !log started logstash on logstash1002 [00:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:17:20] !log restarting tcpircbot on neon [00:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:19] Krenair: seems to work. [00:18:21] i don't know why logstash was stopped [00:18:38] mutante: ori found it crashed a few days ago as well [00:18:38] but it was simply not running and i could start it just fine [00:18:41] yeah [00:18:43] last time it was logstash1001 and it exceeded its own self-imposed memory limit [00:18:49] RECOVERY - logstash process on logstash1002 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [00:19:14] ok [00:19:21] !log restarted replication on db1051 [00:19:25] orly [00:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:19:32] ok [00:20:42] "Java 7 support is only best effort, it may not work." [00:20:46] in err.log :p [00:21:05] (03PS2) 10EBernhardson: Send the api request log to kafka [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240617 (https://phabricator.wikimedia.org/T108618) [00:21:08] "It will be removed in next release (1.0)." [00:21:40] oh [00:21:48] do they expect you to use java8? [00:22:19] (03Abandoned) 10Yuvipanda: [WIP] [tools/apt] Multi-distro packaging [puppet] - 10https://gerrit.wikimedia.org/r/221456 (owner: 10Merlijn van Deen) [00:23:57] 6operations, 10Math, 5Patch-For-Review: Install texlive-extra-utils on mw appservers - https://phabricator.wikimedia.org/T109195#1669103 (10Dzahn) a:3Physikerwelt @Physikerwelt See our comments above, texlive-latex-extra is already installed. Could it be about the whitelisting in the extension? Just assign... [00:24:20] yuvipanda: yes [00:24:54] WARN -- Concurrent: [DEPRECATED] Java 7 is deprecated, please use Java 8. [00:25:07] ouch [00:25:09] (03PS7) 10Yuvipanda: mediawiki: Add test to verify redirects.conf has been regenerated from redirects.dat [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) (owner: 10Legoktm) [00:25:38] and you took that one too :) i was about to [00:25:47] i wonder if there is still a diff now if you regenerate it [00:27:03] (03PS4) 10Ricordisamoa: Don't match Phabricator task IDs inside URLs [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) [00:27:17] oh, nevermind, different thing [00:28:02] (03CR) 10Ricordisamoa: Don't match Phabricator task IDs inside URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [00:28:15] (03PS5) 10Ricordisamoa: Don't match Phabricator task IDs inside URLs [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) [00:30:59] (03PS2) 10BBlack: improve XFF/XFP/XRIP code in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/240582 [00:33:03] (03PS3) 10BBlack: improve XFF/XFP/XRIP code in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/240582 [00:33:34] (03CR) 10Yuvipanda: [C: 032] "I think the generated file should only be generated, so if someone has local uncommited changes to that file... they shouldn't :D" [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) (owner: 10Legoktm) [00:34:50] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1669147 (10yuvipanda) Good Job Lego (As he asked me to type) - the redirects are checked now. There is lots of other apache... [00:38:25] (03CR) 10Dzahn: move misc/labsdebrepo out of misc to module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [00:40:44] (03PS10) 10Dzahn: move misc/labsdebrepo out of misc to module [puppet] - 10https://gerrit.wikimedia.org/r/194796 [00:41:33] (03CR) 10jenkins-bot: [V: 04-1] move misc/labsdebrepo out of misc to module [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [00:43:29] (03PS11) 10Dzahn: move misc/labsdebrepo out of misc to module [puppet] - 10https://gerrit.wikimedia.org/r/194796 [00:47:01] (03Abandoned) 10EBernhardson: Maintain existing api.log format when adding context [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240616 (https://phabricator.wikimedia.org/T108618) (owner: 10EBernhardson) [00:47:26] (03CR) 10Dzahn: "addressed comments from Faidon, rebased, noted how it's not used in toollabs module init.pp anymore." [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [00:47:50] yuvipanda: https://gerrit.wikimedia.org/r/#/c/194796/ ?:) [00:49:15] i would run it in compiler but i'm not sure what to run it on [00:50:00] (03PS12) 10Dzahn: move misc/labsdebrepo out of misc to module [puppet] - 10https://gerrit.wikimedia.org/r/194796 [00:50:35] (03PS3) 10Dzahn: annualreport: puppetize git cloning [puppet] - 10https://gerrit.wikimedia.org/r/240606 [00:51:35] mutante: don't think it'll work [00:53:20] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:55:10] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10501 bytes in 1.125 second response time [00:55:18] arghhhhhhhhhhhhhhh [00:55:26] bblack: ^ yay more pages [00:57:55] it's always the same pattern.. again [00:58:44] around the same time, for the same amount of time, and always just IPv6 combined with mobile [00:59:56] (03CR) 10Ori.livneh: improve XFF/XFP/XRIP code in common VCL (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/240582 (owner: 10BBlack) [01:01:07] (03PS1) 10Dzahn: broken redirects.conf to test jenkins [puppet] - 10https://gerrit.wikimedia.org/r/240626 [01:01:59] (03CR) 10jenkins-bot: [V: 04-1] broken redirects.conf to test jenkins [puppet] - 10https://gerrit.wikimedia.org/r/240626 (owner: 10Dzahn) [01:05:25] (03PS2) 10Dzahn: broken redirects.conf to test jenkins [puppet] - 10https://gerrit.wikimedia.org/r/240626 [01:05:40] (03PS3) 10Dzahn: broken redirects.dat to test jenkins [puppet] - 10https://gerrit.wikimedia.org/r/240626 [01:06:24] (03CR) 10jenkins-bot: [V: 04-1] broken redirects.dat to test jenkins [puppet] - 10https://gerrit.wikimedia.org/r/240626 (owner: 10Dzahn) [01:07:59] (03PS4) 10Dzahn: broken redirects.conf to test jenkins [puppet] - 10https://gerrit.wikimedia.org/r/240626 [01:09:07] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1669220 (10Dzahn) tested it. just changed redirects.conf, but not .dat https://gerrit.wikimedia.org/r/#/c/240626/1 Assert... [01:11:13] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1669223 (10Dzahn) 5stalled>3Resolved a:3Dzahn thank you @Legoktm @Yuvipanda maybe there should be more checks, yea.... [01:11:35] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1669227 (10Dzahn) a:5Dzahn>3Legoktm [01:12:17] (03Abandoned) 10Dzahn: broken redirects.conf to test jenkins [puppet] - 10https://gerrit.wikimedia.org/r/240626 (owner: 10Dzahn) [01:15:30] (03CR) 10Dzahn: "This difference doesn't exist anymore. I just tried it again to make sure. "no changes added to commit" when running the script and adding" [puppet] - 10https://gerrit.wikimedia.org/r/204996 (owner: 10Legoktm) [01:15:45] (03Abandoned) 10Dzahn: mediawiki: Update redirects.conf using refreshDomainRedirects [puppet] - 10https://gerrit.wikimedia.org/r/204996 (owner: 10Legoktm) [01:15:53] !log ori@tin Synchronized php-1.26wmf24/includes: Ifa0d4cfe8e3: Backport I1ff61153d and I8e4c3d5a5 (duration: 00m 23s) [01:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:18:55] 6operations, 5Patch-For-Review: Backup and decom home_pmtpa - https://phabricator.wikimedia.org/T113265#1669251 (10Dzahn) backup done: ``` The defined FileSet resources are: 1: home 2: home_pmtpa Select FileSet resource (1-2): 2 +--------+-------+-----------+-------------------+--------------------... [01:19:40] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [01:23:29] (03PS4) 10Dzahn: Remove home_pmtpa and svn client from bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/231142 (https://phabricator.wikimedia.org/T113265) (owner: 10Faidon Liambotis) [01:25:04] (03PS5) 10Dzahn: Remove home_pmtpa and svn client from bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/231142 (https://phabricator.wikimedia.org/T113265) (owner: 10Faidon Liambotis) [01:26:08] (03CR) 10Dzahn: [C: 032] "the was no reply on engineering, only +1's and the backup exists on bacula now" [puppet] - 10https://gerrit.wikimedia.org/r/231142 (https://phabricator.wikimedia.org/T113265) (owner: 10Faidon Liambotis) [01:27:57] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/231142 (https://phabricator.wikimedia.org/T113265) (owner: 10Faidon Liambotis) [01:28:41] (03PS2) 10Dzahn: Remove the subversion module [puppet] - 10https://gerrit.wikimedia.org/r/239126 (owner: 10Faidon Liambotis) [01:29:00] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:30:39] (03CR) 10Dzahn: [C: 032] Remove the subversion module [puppet] - 10https://gerrit.wikimedia.org/r/239126 (owner: 10Faidon Liambotis) [01:34:28] !log removing subversion packages from bast1001 [01:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:35:07] (03CR) 10Dzahn: "re: manual clean up:" [puppet] - 10https://gerrit.wikimedia.org/r/231142 (https://phabricator.wikimedia.org/T113265) (owner: 10Faidon Liambotis) [01:35:24] !log bast1001: unmounting /srv/home_pmtpa (backup on bacula) [01:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:36:30] 6operations, 5Patch-For-Review: Backup and decom home_pmtpa - https://phabricator.wikimedia.org/T113265#1669315 (10Dzahn) re: manual clean up: bast1001: ``` apt-get remove --purge subversion The following packages will be REMOVED: subversion* ``` ``` apt-get autoremove The following packages will be REM... [01:36:38] 6operations, 7Performance, 5Release-Engineering-Epics: [EPIC] Performance testing environment - https://phabricator.wikimedia.org/T67394#1669316 (10greg) [01:37:50] PROBLEM - logstash process on logstash1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [01:37:59] 6operations, 5Patch-For-Review: Backup and decom home_pmtpa - https://phabricator.wikimedia.org/T113265#1669321 (10Dzahn) was this all or should anything be done on nas1001-a.eqiad? If so, please do. [01:38:20] 6operations: Backup and decom home_pmtpa - https://phabricator.wikimedia.org/T113265#1669322 (10Dzahn) [01:46:58] 7Blocked-on-Operations, 7Puppet, 6Reading-Infrastructure-Team, 10Sentry, and 2 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1669334 (10greg) [01:48:46] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Decommission sodium - https://phabricator.wikimedia.org/T110142#1669345 (10Dzahn) eh, @Faidon to get that one last backup i guess i have to re-add it to puppet for a moment, right :p [01:48:51] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp3015_v6 [01:49:30] 6operations, 6Release-Engineering-Team, 10Wikimedia-Apache-configuration: Make it possible to quickly and programmatically pool and depool application servers - https://phabricator.wikimedia.org/T73212#1669349 (10greg) [01:49:32] 6operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering-Team, 7HHVM: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#1669348 (10greg) [01:49:42] 6operations, 10Deployment-Systems, 6Performance-Team, 7HHVM: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#1414314 (10greg) [01:50:41] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [01:51:45] mutante: Looks like logstash didn't come back up properly. https://logstash.wikimedia.org/#/dashboard/elasticsearch/slow-parse and other logs are all idle. It stopped at exactly midnight UTC and not a single message has come through since hten [01:52:01] 6operations, 10Gitblit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1669357 (10greg) [01:52:12] Krinkle: all i did was "start" and the status said running [01:52:50] eh, i just started it again [01:52:56] @logstash1002:~# /etc/init.d/logstash status [01:52:56] logstash is not running [01:53:02] /etc/init.d/logstash start [01:53:02] logstash started. [01:53:23] !log started logstash on logstash1002 again [01:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:53:32] logstash is running [01:54:30] RECOVERY - logstash process on logstash1002 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [01:56:07] mutante: I think there's a bigger problem somewhere. It stopped at exactly 23:59:999 2 hours ago. [01:56:19] And neither of the restart resulted in even a single message coming through [01:58:06] (03CR) 10QChris: [C: 04-1] "CR-1 per my comments on PS1, which still apply." [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [02:07:30] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3048_v6 [02:09:29] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [02:09:52] (03CR) 10coren: [C: 031] "+1 for the semantics, but I'm not comfortable with the ruby/puppet side of things." [puppet] - 10https://gerrit.wikimedia.org/r/230477 (https://phabricator.wikimedia.org/T95747) (owner: 10Tim Landscheidt) [02:10:09] (03PS3) 10coren: Tools: Remove gridengine aliases for some hosts [puppet] - 10https://gerrit.wikimedia.org/r/235157 (https://phabricator.wikimedia.org/T109485) (owner: 10Tim Landscheidt) [02:11:37] (03CR) 10coren: [C: 032] "Yep." [puppet] - 10https://gerrit.wikimedia.org/r/235157 (https://phabricator.wikimedia.org/T109485) (owner: 10Tim Landscheidt) [02:14:35] !log Partial EventLogging outage (client-side events via hafnium abruptly stopped 2015-09-23 11:36 UTC - 15 hours ago) [02:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:16:02] 6operations, 10MediaWiki-extensions-ZeroPortal, 10Traffic, 6Zero: zerofetcher in production is getting throttled for API logins - https://phabricator.wikimedia.org/T111045#1592889 (10Yurik) Do we know what is actually doing the rate check and blocking? Is that a backend feature / an extension? [02:16:33] !log Kibana/Logstash outage. Zero events received after 2015-09-23T23:59:59.999Z. [02:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:20:04] sad! [02:23:01] !log kill gmond on hafnium and disable puppet to prevent it from taking it back up. Was taking 100% CPU [02:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:24:06] Krinkle: I'm going to look at logstash maybe? [02:25:59] I find {:timestamp=>"2015-09-24T02:25:24.664000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn} [02:26:03] in logstash err [02:26:21] Java 7 support is only best effort, it may not work. It will be removed in next release (1.0). [02:26:23] jeez [02:27:18] yuvipanda: I gotta go, I'll do a brief write up on ops-l and leave it to you and others to figure out. I'm just observing it. And there isn't much I can do about it unfortuantely. [02:27:19] Thanks :) [02:27:30] Krinkle: logstash or eventlogging? [02:27:34] both [02:27:35] ok [02:27:41] EL is definitely priority I think [02:27:56] I'm going to page otto soon [02:28:08] ok [02:34:38] Yoo yuvipanda what's happenin? [02:34:55] ottomata: hi [02:35:07] ottomata: https://grafana.wikimedia.org/#/dashboard/db/eventloggingschema [02:35:22] ottomata: Krinkle reported EL dead for 15h, I found gmond on hafnium taking up 100% CPU, killed it but still dead [02:35:46] ok, i think eventlogging is not dead, this is a hafnium problem [02:35:59] I'm not really sure how things fit into what and what does what [02:36:22] there is a bug... [02:36:23] ottomata: eventlogging.schema.* is not populated from hafnium though. [02:36:23] don't really see hafnium on https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Architecture [02:36:27] this happened to ori before [02:36:30] that's pushed to statsd from EL directly [02:36:39] and that metric is dead [02:36:39] hmmmmm, no it is on hafnium [02:36:41] pretty sure [02:36:44] checking... [02:36:56] Ah could be in one of hte other consumers then [02:37:00] I gotta go, see ops-l. Thanks! [02:37:02] hmmm [02:37:02] I've disabled puppet on hafnium [02:37:12] should I re-enable it and run it and see what it brings up other than gmond? [02:38:22] yeah i dunno what's up with gmond there [02:38:26] this happened a few days ago too [02:38:31] i think ori responded and did somehting [02:38:33] can't find the bug right now [02:38:36] he killed it [02:38:40] and disabled puppet for a day [02:38:46] !log l10nupdate@tin Synchronized php-1.26wmf23/cache/l10n: l10nupdate for 1.26wmf23 (duration: 06m 30s) [02:38:50] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [02:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:53] yuvipanda: Krinkle http://grafana.wikimedia.org/#/dashboard/db/eventlogging?panelId=6&fullscreen [02:39:04] overall.inserted.rate is pretty normal [02:39:12] which means that events are being inserted into mysql [02:39:35] I see [02:39:43] I'm going to re-enable puppet [02:39:48] and see what it brings up [02:39:49] ok [02:40:07] !log re-enabling and running puppet on hafnium to see what it's bringing up [02:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:21] i think something is weird with hafnium, for sure. not sure what. buUuUUUt I have some people over soOooOo I'm not going to investigate further right now, 'sok? [02:40:28] if it turns out htings are really not ok, text me again? [02:40:38] ottomata: I don't really know what is and is not ok :) [02:40:41] and Krinkle is gone too... [02:40:50] puppet didn't update anything [02:40:57] ottomata: if you think nothing's wrong I can probably go too [02:41:22] ottomata: https://grafana.wikimedia.org/#/dashboard/db/eventloggingschema [02:41:29] !log gmond at 100% again, killing it and stopping puppet again [02:41:31] dropped dead 15h ago, nothing since. [02:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:41:59] milimetric: might also be around, he responded on anothe rchannel [02:42:08] yeah, Krinkle that is some monitoring problem with hafnium, i'm not really sure where those metrics come from. i would get them from kafka now too, would be more reliable [02:42:18] I'm here [02:42:32] I'm logging into the master db now to select to make sure the metrics aren't tricking us [02:42:37] ok! [02:42:45] ja thanks milimetric [02:43:26] yuvipanda: milimetric [02:43:27] ^C1d [@eventlog1001:/home/otto] 130 $ sudo tail -f /var/log/upstart/eventlogging_consumer-mysql-m4-master.log | grep inserted [02:43:36] ... [02:43:36] 2015-09-24 02:43:29,586 (Thread-15 ) Data inserted 1 [02:43:37] 2015-09-24 02:43:30,057 (Thread-15 ) Data inserted 66 [02:43:37] 2015-09-24 02:43:30,159 (Thread-15 ) Data inserted 24 [02:43:37] ... [02:45:12] does it mean it's working? [02:45:27] according to the mysql consumer, it is inserting events just fine [02:45:34] http://grafana.wikimedia.org/#/dashboard/db/eventlogging [02:45:36] http://i3.kym-cdn.com/photos/images/newsfeed/000/234/719/c7c.jpg [02:45:42] looks pretty normal [02:45:50] haha [02:46:04] ok, so can you or someone respond on ops@ list saying there seems to be no actual data loss? [02:46:10] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:46:14] and then I can continue grocery shopping :) [02:46:47] or looking at logstash [02:49:12] * ori looks at hafnium [02:50:50] thanks ori [02:51:14] !log restarted logstash on logstash1002 [02:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:40] yuvipanda: am responding [02:51:58] wtf are these machines? http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [02:52:04] like 10.68.17.157 [02:53:13] i am signing off, if things blow up forrealzies do text, thanks yuvipanda! [02:53:46] I guess some part of eventlogging that monitors in ZMQ got deployed out of schedule and thus broke those properties [02:53:58] I see commits that remove that reporting [02:53:59] Fat fingered the power button and my laptop died [02:54:02] but that was days ago [02:54:10] http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&h=10.68.17.157&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=Miscellaneous+eqiad [02:54:23] seriously, wtf is that host? [02:55:37] 157.17.68.10.in-addr.arpa. 60 IN PTR deployment-zookeeper01.deployment-prep.eqiad.wmflabs. [02:55:39] ori: ^ [02:55:43] ori: deployment-zookeeper01 [02:55:44] no idea why that's coming through to ganglia [02:56:02] wow, wtf [02:56:11] Ugh, grafana doesn't support putting a template value inside a string. So kafka.eventlogging_{topic} and eventlogging.schema.{topic} can't be used at the same time, to e.g. compute error rates. [02:59:38] chasemp: around? am trying to debug ElasticSearch / logstash but no dice... [02:59:44] I see that traffic is coming in [03:00:09] yuvipanda / ori: I looked at the actual data on databases, and everything looks ok. Except a weird spike in events for a 10 minute period on the Edit schema. But that's the opposite of not enough events being inserted [03:00:48] so I'll sign off too. Feel free to text me if you need me. [03:01:17] milimetric: ok thanks [03:02:05] !log jstack dumped logstash output onto /home/yuvipanda/stack on logstash1001 since strace seems useles [03:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:02:59] hmm there's a bunch of threads blocked on sleep but also lots of threads actually reading [03:03:10] PROBLEM - LVS HTTPS IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [03:05:00] RECOVERY - LVS HTTPS IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 3.037 second response time [03:05:50] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:06:24] https://logstash.jira.com/browse/LOGSTASH-1382 might or might not be related [03:07:14] Krinkle: how did you find the last sent log message? [03:07:25] yuvipanda: logstash, last 6 hours [03:07:42] top of the view is the 23:59:999 entry [03:08:53] I see [03:10:03] I've no idea what's going on unfortunately [03:12:54] ok, so I'm getting messages in [03:12:56] so that's established [03:13:47] and it's even sending data into the local elasticsearch [03:14:13] no [03:14:20] it's only apifeatureusage [03:14:21] hmm [03:16:26] okkk [03:16:31] it's only hitting localhost with requests to / [03:16:40] ok forget I said that [03:16:52] a big bunch of data just went paste [03:18:51] none of them seem to have made it however [03:19:01] I've no idea whom to page [03:19:12] logstash is in some ways abandonware eh [03:21:53] I should really get off this park bench and stop talking to myself [03:22:52] anyone know how to check the apifeature thing? [03:26:10] RECOVERY - Disk space on labstore1002 is OK: DISK OK [03:26:38] what about it? [03:27:01] TimStarling: I was tcpdumping and saw some of those go through to the main elasticsearch cluster [03:27:06] TimStarling: I wanna see if it's still updating fine [03:27:12] to try to see if the problem is with logstash or elasticsearch [03:28:18] ebernhardson: around? am trying to figure out if Elasticsearch is fucked in logstash [03:30:09] PROBLEM - logstash process on logstash1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [03:31:57] I don't know anything about logstash but it's getting a bit late for SF so maybe I'd better try and help out [03:32:00] RECOVERY - logstash process on logstash1001 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [03:32:39] is that you restarting logstash, yuvipanda? [03:33:03] TimStarling: it died and I started it again [03:33:10] ok, so I think the logstash ES cluster is sick [03:33:18] I see it sending events to the main ES cluster [03:33:27] where the apifeature stuff lives [03:34:17] with tcpdump -A -i lo [03:34:23] you see that it's just doing health checks to local ES [03:34:25] and nothing else [03:35:46] hmm it's stopped sending those too now [03:37:52] so elasticsearch hasn't been restarted yet [03:38:05] no [03:38:17] not sure if that should be co-ordinated across the cluster or not [03:39:51] hmm I don't see apifeature packets anymore either [03:40:35] aha [03:40:42] so logstash1003 ES logfile actually has errors [03:41:50] aaand I think the time of the exception corresponds to stopping of events [03:42:06] nope [03:42:07] it doesn't [03:42:09] nevermind [03:42:13] another red herring [03:43:22] haha! [03:43:30] [2015-09-24 03:42:16,822][INFO ][cluster.routing.allocation.decider] [logstash1004] low disk watermark [85%] exceeded on [rT_zbyrOSCGIMp-hArZ3eA][logstash1005] free: 348.4gb[12.5%], replicas will not be assigned to this node [03:43:31] mayybe [03:43:33] that's what's going on [03:43:42] yuppp [03:43:45] I was looking in wrong place [03:43:51] logstash1004-6 are the actual ES nodes [03:44:37] dcausse: around? [03:45:44] I see 'status': yellow [03:47:15] ok current theory is that logstash ES is basically full [03:48:42] (03PS1) 10Yuvipanda: elasticsearch: Increase low disk watermark to 90% of disk [puppet] - 10https://gerrit.wikimedia.org/r/240633 [03:48:43] TimStarling: ^ I'm going to try this [03:49:06] (03PS2) 10Yuvipanda: elasticsearch: Increase low disk watermark to 90% of disk [puppet] - 10https://gerrit.wikimedia.org/r/240633 [03:49:16] hmm [03:49:21] gotta be careful to not fuck prod elasticsearch [03:50:14] ok no notifys there [03:50:18] so it shouldn't restart by itself [03:50:34] (03CR) 10Yuvipanda: [C: 032 V: 032] "No notifys so shouldn't restart by itself." [puppet] - 10https://gerrit.wikimedia.org/r/240633 (owner: 10Yuvipanda) [03:51:54] ok [03:51:59] forcing a puppet run on all three hosts now [03:52:27] fair enough [03:52:37] TimStarling: yeah they're all at 86% [03:52:41] so if I bump it to 90% [03:52:44] and it comes back... [03:52:48] we know that's what it is [03:52:59] df says 87-88% usage [03:53:24] yeah [03:53:33] yeah, sorry that's what they were [03:53:47] !log restarted es on logstash1004-6 [03:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:54:40] ok that Might be working? [03:55:06] so it is just a question of expiry policy? I see indices going back to 2015-08-25 [03:55:14] what cleans them? [03:55:26] not sure [03:55:34] I saw a cronjob that was stuck for many hours and killed it [03:55:38] maybe that does? [03:55:51] every night -30day index is dropped [03:56:10] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 23 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 11, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 85, initializing_shards: 12, number_of_data_nodes: 3, [03:56:15] maybe we've had too much error traffic lately that they can't hold 30 days' worth [03:56:20] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 23 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 11, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 85, initializing_shards: 12, number_of_data_nodes: 3, [03:56:20] PROBLEM - ElasticSearch health check for shards on logstash1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 23 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 11, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 85, initializing_shards: 12, number_of_data_nodes: 3, [03:56:20] PROBLEM - ElasticSearch health check for shards on logstash1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 23 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 11, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 85, initializing_shards: 12, number_of_data_nodes: 3, [03:56:38] heh [03:57:00] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 23 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 11, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 85, initializing_shards: 12, number_of_data_nodes: 3, [03:57:00] PROBLEM - ElasticSearch health check for shards on logstash1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 23 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 11, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 85, initializing_shards: 12, number_of_data_nodes: 3, [03:57:14] https://github.com/wikimedia/operations-puppet/blob/f290715b9a8b9b880f313f7e42c53cee44a98cda/modules/logstash/manifests/output/elasticsearch.pp#L56 [03:57:44] perhaps we can manually drop a few more days to free up space if that's the issue [03:58:10] setting the watermark to 90% when we have 88% used seems like a temporary issue [03:58:13] temporary fix rather [03:58:23] yeah but I'm not sure if that's the actual issue... [03:58:26] since logs aren't back yet [03:58:28] sure [03:58:40] they came back and went out again [03:58:47] yuvipanda: A small amount of messages just got through in kibana yeah [03:58:56] first burst since 5 hours ago [03:58:59] yeah [03:59:11] !log restarting elasticsearch in logstash1001-3 [03:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:00:51] RECOVERY - Last backup of the maps filesystem on labstore1002 is OK: OK - Last run successful [04:01:39] PROBLEM - logstash process on logstash1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [04:02:05] hmm [04:03:02] Sep 24, 2015 3:59:15 AM org.apache.http.impl.execchain.RetryExec execute [04:03:02] INFO: Retrying request to {}->http://127.0.0.1:9200 [04:03:02] Manticore::SocketException: Connection refused [04:03:17] says logstash1001 [04:03:21] PROBLEM - logstash process on logstash1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [04:03:37] curl localhost:9200/_cat/nodes?v=5 [04:03:41] for cluster node and health [04:03:59] let me start logstash on these [04:04:04] and see if the round of restarts fixed it [04:04:10] PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [04:04:38] I guess logstash dies if you take down elasticsearch at all [04:04:46] yeah [04:05:09] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [100000000.0] [04:05:19] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:05:19] RECOVERY - logstash process on logstash1001 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [04:05:20] RECOVERY - logstash process on logstash1002 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [04:05:25] {:timestamp=>"2015-09-24T04:05:12.780000+0000", :message=>"retrying failed action with response code: 503", :level=>:warn} [04:06:00] RECOVERY - logstash process on logstash1003 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [04:07:49] 6operations, 6Discovery, 6Security, 7Elasticsearch: Upgrade CirrusSearch's Elasticsearch cluster to 1.3.8+ - https://phabricator.wikimedia.org/T92853#1669546 (10Deskana) [04:08:06] 6operations, 6Discovery, 6Security, 7Elasticsearch: Upgrade CirrusSearch's Elasticsearch cluster to 1.3.8+ - https://phabricator.wikimedia.org/T92853#1122033 (10Deskana) [04:09:48] it might just be recovering [04:09:56] I see ES doing a lot of CPU [04:09:59] on the data nodes [04:10:20] with kswapd awake? [04:10:47] ugh yes [04:11:08] 1TB virtual memory usage presumably doesn't help performance [04:11:13] yeah... [04:11:16] but maybe that is normal [04:11:21] ES is supposed to cap itself to 30G according to Hiera. [04:11:29] but that's 'heap space' not sure what that is in terms of reality [04:11:40] resident is 29GB [04:11:56] it's just virtual memory that is bloated [04:11:59] ok this park is getting cold and homess people are moving in and I have no glasses on [04:12:09] I'll walk home and be back on laptop in about 30min [04:12:15] yeah, nobody is dying while this is down [04:12:18] yeah [04:12:22] brb [04:12:47] https://www.elastic.co/guide/en/elasticsearch/reference/1.4/cat-health.html has some useful stuff [04:13:21] (03PS2) 10EBernhardson: Refactor monolog handling to point to 1-N sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [04:13:42] (03CR) 10jenkins-bot: [V: 04-1] Refactor monolog handling to point to 1-N sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [04:15:04] * ebernhardson grumbles about tests passing locally [04:17:12] if you look at the last hour's network for the logstash cluster, its gone from a MB/s to about 100 [04:17:50] and the available cluster memory is reported as much lower than before [04:17:53] if i had to guess, lost a node? [04:17:59] * ebernhardson can't login to those servers [04:19:23] hmm, allocation claims 3 data nodes are up, but there are 11 unassigned indices. its basically recovering from losing a node and having it come back [04:21:02] but it's been down since about 00:00 hasn't it? [04:21:26] ah yes, daily ganglia says CPU dropped at that time [04:21:39] the network graph spikes ~3:55 utc [04:22:05] that's when yuvipanda restarted it [04:22:15] ahh, ok so thats the network traffic [04:22:26] so you think the restart fixed it and now it will recover by itself? [04:22:39] well, i don't know. but i do know elastic doesn't like to recover from reboots [04:22:52] in the search cluster it takes an hour and a half to two hours to settle down [04:22:52] why can you not log in? [04:23:01] i have root in the search cluster, but not logstash [04:23:52] ebernhardson: yes I wonder if the reboots fucked it up more [04:24:29] ebernhardson: do you think it'll recover at all? [04:24:38] in theory, the way logstash uses es should be alot nicer than our primary, no deletes so it should handle reboots better [04:24:48] Also I probably should have waited a lot more before restarting each node... [04:24:50] yuvipanda: i dunno, the cat api's don't look that bad beyond recovering from a reboot [04:25:02] ahh its just losing a little logs, its already losing logs [04:25:09] i wouldn't worry about rebooting wrong :) [04:25:34] ebernhardson: yeah it was losing logs for a loong time [04:25:38] So I didn't feel too bad [04:25:57] yuvipanda: what about the actual logstash daemons? [04:26:08] yuvipanda: that pipe the mediawiki logs into es [04:26:10] They died again when I restarted ea [04:26:10] nik has logstash root [04:26:18] I started them again [04:26:35] I think elastic roots should have logstash root too [04:26:38] if those are dieing, imo thats probably the source ofthe problem [04:26:48] (of logs not getting in) [04:27:12] i'll make a ticket and ask, it's not like someone said i cant have access [04:27:24] ebernhardson: the data nodes reported that they were over 85% full and that was over the low watermark [04:27:25] so add dcausse and tomasz? [04:27:46] (03PS1) 10Aaron Schulz: Removed ignore_user_abort( true ) line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240634 [04:27:48] and ebernhardson of course [04:27:53] And no new shards are gonna be added. On all 3 data nodes [04:29:11] ebernhardson: I just made a patchset [04:30:28] (03PS1) 10Tim Starling: Give ElasticSearch roots access to the logstash boxes also [puppet] - 10https://gerrit.wikimedia.org/r/240635 [04:30:58] 10Ops-Access-Requests, 6operations: Give elasticsearch roots access to the logstash cluster as well. - https://phabricator.wikimedia.org/T113569#1669575 (10EBernhardson) 3NEW [04:31:28] (03PS2) 10EBernhardson: Give ElasticSearch roots access to the logstash boxes also [puppet] - 10https://gerrit.wikimedia.org/r/240635 (https://phabricator.wikimedia.org/T113569) (owner: 10Tim Starling) [04:31:37] ebernhardson: TimStarling no [04:31:47] just updated commit message with an ops-access-request phab ticket # [04:31:58] (03PS1) 10Yuvipanda: admin: Give ES roots root on the Logstash ES boxes [puppet] - 10https://gerrit.wikimedia.org/r/240636 [04:32:03] ebernhardson: TimStarling ^ I guess [04:32:07] is nicer [04:32:11] suit yourself [04:32:13] (popped into a burger joint for food) [04:32:14] ok [04:32:16] I'll just merge [04:32:28] fine by me [04:32:41] (03PS2) 10Yuvipanda: admin: Give ES roots root on the Logstash ES boxes [puppet] - 10https://gerrit.wikimedia.org/r/240636 [04:32:46] less duplication [04:32:58] (03PS3) 10Yuvipanda: admin: Give ES roots root on the Logstash ES boxes [puppet] - 10https://gerrit.wikimedia.org/r/240636 (https://phabricator.wikimedia.org/T113569) [04:33:17] (03CR) 10Yuvipanda: [C: 032 V: 032] admin: Give ES roots root on the Logstash ES boxes [puppet] - 10https://gerrit.wikimedia.org/r/240636 (https://phabricator.wikimedia.org/T113569) (owner: 10Yuvipanda) [04:33:31] ok [04:33:40] I have to actually eat now [04:34:02] but I'll force a puppet run first [04:34:10] I can do it if you want [04:34:49] TimStarling: oh, already started... [04:34:52] but I'm off keyboard now. [04:36:35] ebernhardson: you should have access now [04:41:07] denied [04:41:43] to which server? [04:41:44] yuvipanda: interesting, 4 let me in, 1-3 gave denied [04:41:57] logstash100[123] [04:42:04] oh right. those are the data nodes [04:42:25] logstash role vs logstash::elasticsearch [04:44:26] we could just merge the logstash admin and elastic admin groups into a single group [04:44:35] it would make a lot of sense [04:45:16] probably [04:45:41] TimStarling: or ori can you make a patch givung ebernhardson access to the other 3? [04:45:51] ok [04:45:56] thanks [04:47:35] mark explicitly asked me not to use root to grant cluster privileges [04:47:40] so i better not [04:47:52] ok [04:48:18] but tim can [04:48:43] (03CR) 10Glaisher: "That just addresses only part of my comment in PS5." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237330 (https://phabricator.wikimedia.org/T104797) (owner: 10Mdann52) [04:48:52] i'm happy to merge :) [04:53:26] (03PS1) 10Tim Starling: Give elasticsearch-roots access to hosts in the logstash role [puppet] - 10https://gerrit.wikimedia.org/r/240637 [04:53:29] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:53:37] (03PS3) 10EBernhardson: Refactor monolog handling to point to 1-N sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [04:53:57] (03CR) 10jenkins-bot: [V: 04-1] Refactor monolog handling to point to 1-N sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [04:54:03] I don't have a high degree of familiarity with these puppet roles but I think that is the right place to patch [04:54:24] (03CR) 10Yuvipanda: [C: 031] Give elasticsearch-roots access to hosts in the logstash role [puppet] - 10https://gerrit.wikimedia.org/r/240637 (owner: 10Tim Starling) [04:54:28] yup [04:54:36] can you merge and force a run? [04:54:39] yes [04:54:45] tx [04:54:47] yes, that's correct [04:54:52] (03CR) 10Tim Starling: [C: 032] Give elasticsearch-roots access to hosts in the logstash role [puppet] - 10https://gerrit.wikimedia.org/r/240637 (owner: 10Tim Starling) [04:55:08] 6operations, 10Sentry, 10hardware-requests: Procure hardware for Sentry - https://phabricator.wikimedia.org/T93138#1669614 (10Tgr) [04:55:15] 7Blocked-on-Operations, 7Puppet, 6Reading-Infrastructure-Team, 10Sentry, and 2 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1669610 (10Tgr) 5Open>3Resolved a:3Tgr [04:58:18] ok, should be done now ebernhardson [05:00:19] TimStarling: looks to have worked, thanks [05:09:48] (03CR) 1020after4: "no conflict. This needs to be done." [puppet] - 10https://gerrit.wikimedia.org/r/226573 (owner: 10Negative24) [05:10:13] (03CR) 1020after4: [C: 031] phab: Add passwd entries for vcs user [puppet] - 10https://gerrit.wikimedia.org/r/226573 (owner: 10Negative24) [05:11:47] ebernhardson: is there a way to check progress of the rebuild? [05:12:00] * yuvipanda wrestles with laptop to try get it to start [05:12:33] (03CR) 1020after4: [C: 031] SSH repo hosting support for phabricator. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) (owner: 1020after4) [05:12:46] (03PS5) 1020after4: SSH repo hosting support for phabricator. [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) [05:13:01] i see events trickling into the parsoid kibana dashboard since the last 30 mins or so ... not sure if the volume is right, but I see stuff there. [05:14:36] the logstash es cluster looks to be inserting ~800 docs/s [05:15:54] and cpu in ganglia went back up about 25 minutes ago, so its looks to be recovered? but i don't know why it was down or why it came back :( [05:16:08] ebernhardson: I increased low watermark from 85 to 90 [05:16:27] yuvipanda: oh, we should just deleted some of the oldest indices then until someone has a better solution [05:16:49] i just don't know if kibana cares how thats done... [05:17:03] there's a cron that does that every night apparently [05:17:12] yeah I see new events coming in [05:17:26] so we lost what, 5h of events? [05:19:14] something like that. another option i suppose would be to reduce the replica level after a certain age, currently this keeps a full copy of all indices on all nodes [05:19:28] i have no clue how much people query those thoug [05:20:19] subbu: there seems to be a ton of ERROR messages from restbase [05:24:14] ebernhardson: how do we drop a few days of the earliest logs? [05:25:59] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: puppet fail [05:26:18] yuvipanda: `localhost:9200/_cat/indices | sort` will give all the indices in order, then `curl -XDELETE localhost:9200/logstash-2015.08.26` [05:27:10] ah I see [05:27:15] ( i suppose it should be sort -k3, but if the clusteris health just sort works :) [05:27:27] heh [05:27:38] should we drop them now or wait for it all to be green? [05:27:44] just drop them now [05:27:51] ok [05:28:50] 14, 15, 16? [05:29:31] best guess would be bad events? the default config auto-creates indices [05:31:24] !log deleted indexes for 08/14, 15, 25, 26 on logstash [05:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:31:51] (03PS4) 10EBernhardson: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [05:32:38] * ebernhardson wonders if now is a good time to mention i'm working on a new destination for high volume logs :) [05:32:48] haha [05:32:51] kafka eh [05:33:10] deleting some more indexes, actually [05:33:11] yea, but the initial logs are too big and arn't going to logstash right now anyways [05:33:31] !log deleted logstash indexes for 08/27 and 28 too [05:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:33:51] ebernhardson: btw, I raised low-watermark to 90 from 85. high is 95. is that ok? [05:36:06] yuvipanda: hmm, the largest shards are 100gb so it will need that much free space per node for joining smaller shards, it might need enough space for an entire index to do the optimize after they stop writing to the index [05:36:32] i would guess, setting the bare limit to >100G and the regular limit to 300G/node is probably sane? [05:36:45] hmm not sure what the percentages are [05:36:50] let me open a bug actually [05:36:59] err, i mean, not sure how the percentages translate [05:37:05] yea i'm checking :) [05:38:03] machines have 2.7tb for ES, so 4% gives 108GB free, and 11% gives 297 free. 5 and 10 are probably reasonable [05:38:14] 5 and 12 might be better :) [05:39:39] 6operations, 10Wikimedia-Logstash: Logstash elasticsearch cluster filled up, dropping logstash events - https://phabricator.wikimedia.org/T113571#1669633 (10yuvipanda) 3NEW [05:39:49] ebernhardson: ^ [05:40:46] ebernhardson: I'm going to respond to the ops@ mail and then go sleep. I guess everything looks healthy now? [05:41:00] yuvipanda: yup looks fine [05:51:39] <_joe_> good morning [05:51:57] (03PS1) 10KartikMistry: Enable suggestions in ca, en, es, fr, it, ja, tr, ru, zh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240638 (https://phabricator.wikimedia.org/T112848) [05:52:10] <_joe_> yuvipanda: anything I should take care about right now? [05:52:17] _joe_: nope [05:52:31] thanks ebernhardson, TimStarling :) [05:53:48] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Give elasticsearch roots access to the logstash cluster as well. - https://phabricator.wikimedia.org/T113569#1669651 (10yuvipanda) 5Open>3Resolved a:3yuvipanda I have already done this, since IMO these are all running elasticsearch and should count... [05:54:09] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:56:06] (03PS5) 10Yuvipanda: phab: Add passwd entries for vcs user [puppet] - 10https://gerrit.wikimedia.org/r/226573 (owner: 10Negative24) [05:56:15] (03CR) 10Yuvipanda: [C: 032] "ok :)" [puppet] - 10https://gerrit.wikimedia.org/r/226573 (owner: 10Negative24) [05:58:19] 6operations, 10Wikimedia-Logstash: Logstash elasticsearch cluster filled up, dropping logstash events - https://phabricator.wikimedia.org/T113571#1669660 (10EBernhardson) The low disk watermark is the point at which it will not allocate the next days index. I'm not exactly sure what the high watermark will do... [06:02:08] (03PS5) 10EBernhardson: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [06:05:46] (03PS1) 10Glaisher: Remove Page and Index namespaces from $wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240640 (https://phabricator.wikimedia.org/T54709) [06:09:10] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 4, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 30, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 82, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_sh [06:09:10] RECOVERY - ElasticSearch health check for shards on logstash1006 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 4, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 30, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 82, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_sh [06:09:11] RECOVERY - ElasticSearch health check for shards on logstash1005 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 4, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 30, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 82, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_sh [06:09:11] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 4, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 30, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 82, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_sh [06:09:24] * ebernhardson sighs..almost worried me :P [06:10:00] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 4, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 30, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 82, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_sh [06:10:00] RECOVERY - ElasticSearch health check for shards on logstash1004 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 4, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 30, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 82, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_sh [06:10:46] ebernhardson: heh :) [06:11:04] ebernhardson: so in hindsight the rebuild was entirely unnecessary and I could've gotten away with just dropping the logs [06:11:06] err [06:11:08] the old indexes [06:23:28] 6operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering-Team, 7HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#1669707 (10greg) The plan of action in this task is good, but can we get subtasks for the things tha... [06:23:46] yuvipanda: yea i think so :) [06:24:03] oh well. I guess that extended the outage by another 45mins at least, I guess [06:24:06] maybe mor [06:24:08] e [06:24:17] but I guess I learnt a lot about ES today :) [06:24:27] see, always an upside :) [06:24:32] the cat api's are pretty useful [06:24:35] yeah [06:25:37] ebernhardson: I also just 'dove in' instead of first going 'stop, let us see what this is set up as ' [06:25:51] ebernhardson: like, I was looking at logs only on logstash1001-3, instead of going 'so where is the master' [06:26:03] just no ElasticSearch fundamentals [06:26:56] I think I should spend some time understanding thefundamentals / basics of all the tech we rely on (Varnish, LVS, ES, MySQL, Apache, HHVM, Memcached, Redis, node) [06:27:01] uh, and cassandra I guess [06:27:18] master doesn't actually matter in es [06:27:24] i mean, it does, but not as an admin really [06:27:25] I didn't see the logs anywhere else [06:27:26] err [06:27:32] I mean, I saw the logs only on 1004 [06:27:35] about the watermark [06:27:36] ah [06:27:38] others had unrelated log errors [06:27:55] from unrelated times [06:28:02] and 1001-3 had nothing at all [06:28:30] oh, well yea. now you know about data nodes :) [06:28:44] indeed [06:28:58] ebernhardson: why have data nodes at all, btw? do the non-data nodes keep data in memory? [06:29:20] yuvipanda: typically for routing [06:29:32] so they don't locally cache? [06:29:36] excepting in this cluster, every data node has a copy of every shard [06:29:56] wonder why this was setup this way [06:30:16] yuvipanda: when es serves a query it basically repeats that query to 1 copy of each shard. Then the one that issued the query gets all those results merges them and rescores them [06:30:29] so, in other setups, the non-data nodes do the merging and the deciding which data nodes to talk to [06:30:36] aaah I see [06:30:41] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:50] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:00] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:10] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 4 failures [06:31:40] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:50] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:29] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:39] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:39] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:53] ^ is always scary when it happens yet completely harmless [06:33:19] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:20] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:30] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:06] 6operations, 10Beta-Cluster, 10RESTBase, 6Services: Firewall rules too restrictive on deployment-restbase0x.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T113528#1669735 (10MoritzMuehlenhoff) This is caused by the Hiera data used by the ferm rules: cassandra::seeds in prod uses hostname... [06:42:42] 6operations, 10Math, 5Patch-For-Review: Install texlive-extra-utils on mw appservers - https://phabricator.wikimedia.org/T109195#1669744 (10Physikerwelt) @Dzahn: The situation is slightly more complicated. MediaWiki uses its own set of commans that are most of the time identical to the tex commands from vari... [06:43:30] 6operations, 10Math, 5Patch-For-Review: Install texlive-extra-utils on mw appservers - https://phabricator.wikimedia.org/T109195#1669748 (10Physikerwelt) [06:47:07] (03PS1) 10Muehlenhoff: Use DNS names in Hiera data for restbase in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/240642 (https://phabricator.wikimedia.org/T113528) [06:54:44] I'd never listened to the original, but as an armchair producer https://www.youtube.com/watch?v=IekTzYgxx7w#t=50s is fucking gold [06:55:03] Hah, wrong channel :) [06:55:07] But I'm right [06:56:00] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:10] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:56:50] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:56:59] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:00] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:57:01] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:09] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:19] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:57:29] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:57:39] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:40] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:59:50] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:18:40] <_joe_> Keegan: you tricked me in looking at a Taylor Swift video [07:18:52] <_joe_> you should be ashamed of yourself. [07:19:05] _joe_: I DIDN'T MEAN IT [07:19:24] _joe_: I've been working on https://en.wikipedia.org/wiki/1989_(Ryan_Adams_album) [07:19:50] I've never heard the Taylor Swift album in its entirety [07:20:17] <_joe_> I mean, it's ok to listen to taylor swift if you're under 18 [07:20:25] I'm a huge Ryan Adams fan [07:20:33] Ha :) [07:20:41] <_joe_> oh, then I'm grateful I didn't finish :P [07:21:26] <_joe_> (I was about to tell you I didn't see you as a brian adams fan :P) [07:21:38] Ha! [07:22:32] _joe_: no dissing taylor swit with legoktm around [07:22:38] https://www.youtube.com/watch?v=xC87JtZECjM [07:22:46] Those were the best days of my life [07:22:47] * yuvipanda likes at least one bryan adams album [07:22:48] and katy perry! [07:22:55] yuvipanda: let him learn that on his own [07:23:06] we've been down that road before...but that's over now! [07:23:20] AaronSchulz: Those were the best days of my life [07:23:28] <_joe_> AaronSchulz: I would've never thought you as a KP fan [07:23:46] isn't lego? [07:24:13] the dead can speak for themselves. [07:24:15] * AaronSchulz isn't KP fan, though Hot n' Cold is good stuff [07:24:16] <_joe_> I mean I listened to both KP and TS just because they did a collaboration with Snoop and Kendrick, respectively (both pretty lame, too) [07:24:33] Kendrick Lamer? [07:24:44] _joe_: Ryan Adams' 1989 is awesome [07:24:48] <_joe_> yuvipanda: the best MC around [07:24:53] <_joe_> Keegan: will listen to it [07:25:07] memcached? [07:25:22] <_joe_> but now I'm supposed to be coding, so it's either uptempo hip hop or classical music :) [07:26:07] _joe_: https://www.youtube.com/watch?v=xZ92nnR1Pt8 :( [07:26:07] err [07:26:07] :) [07:26:07] http://www.newyorker.com/magazine/2014/12/01/sound-sweden [07:26:19] Everything we hear is Swedes. [07:28:36] <_joe_> Keegan: either Swedes or Pharrell [07:28:53] _joe_: The Swedes produce that. [07:29:11] * Keegan isn't knocking, it's a successful formula [07:29:15] <_joe_> no, Pharrell is actually producing a ton of the pop stuff, the good one [07:29:32] <_joe_> and he is more influenced by 70's rnb and funky [07:29:35] _joe_: Is he, or is he in the studio? [07:29:59] How music is made is a bad product :) [07:30:00] <_joe_> Keegan: btw, we're horribly OT here [07:30:23] Indeed :) [07:44:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] "premise looks good to me, minor comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/240582 (owner: 10BBlack) [07:47:22] (03CR) 10Hashar: [C: 04-1] "Some thoughts inline related to the ldapsupportlib." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/240000 (owner: 10Legoktm) [08:04:39] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 42 [08:09:53] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Neil P. Quinn - https://phabricator.wikimedia.org/T113533#1669844 (10Jdforrester-WMF) This signifies manager approval. [08:13:19] (03Abandoned) 10Filippo Giunchedi: icinga: unify swift alerts [puppet] - 10https://gerrit.wikimedia.org/r/209217 (https://phabricator.wikimedia.org/T88974) (owner: 10Filippo Giunchedi) [08:17:39] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 42 [08:32:30] (03PS1) 10Muehlenhoff: Fix definition of $DEPLOYABLE_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/240647 (https://phabricator.wikimedia.org/T113351) [08:33:07] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Neil P. Quinn - https://phabricator.wikimedia.org/T113533#1669875 (10ori) >>! In T113533#1669844, @Jdforrester-WMF wrote: > This signifies manager approval. Oh James, have you learned nothing from Poststructuralism? :) [08:37:20] RECOVERY - HHVM rendering on mw1056 is OK: HTTP OK: HTTP/1.1 200 OK - 65468 bytes in 0.119 second response time [08:37:59] RECOVERY - Apache HTTP on mw1056 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [08:39:10] RECOVERY - HHVM rendering on mw1104 is OK: HTTP OK: HTTP/1.1 200 OK - 65460 bytes in 0.125 second response time [08:39:11] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.048 second response time [08:39:19] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.050 second response time [08:39:21] !log restarted HHVM @ mw1056, 1104, 1122 [08:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:39:29] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 65460 bytes in 0.440 second response time [08:41:32] (03PS2) 10Jcrespo: Deleting all mention of old servers on production config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239441 [08:42:31] (03CR) 10Jcrespo: [C: 032] Deleting all mention of old servers on production config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239441 (owner: 10Jcrespo) [08:47:14] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1669896 (10Chmarkine) [[ https://letsencrypt.org/ | Let's Encrypt ]] provides free trusted(*) DV non-wildcard certs. We have 31 domains lists [[... [08:50:44] I am merging, but not deploying just for a couple of comment updates [08:50:57] will deploy later with more changes [08:53:19] PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:21] PROBLEM - puppet last run on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:29] PROBLEM - configured eth on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:30] PROBLEM - Hadoop DataNode on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:30] PROBLEM - Disk space on Hadoop worker on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:41] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:50] PROBLEM - Check size of conntrack table on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:59] PROBLEM - RAID on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:00] PROBLEM - dhclient process on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:59] RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [08:55:01] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [08:55:09] RECOVERY - configured eth on analytics1047 is OK: OK - interfaces up [08:55:10] RECOVERY - Hadoop DataNode on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [08:55:11] RECOVERY - Disk space on Hadoop worker on analytics1047 is OK: DISK OK [08:55:29] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING [08:55:39] RECOVERY - Check size of conntrack table on analytics1047 is OK: OK: nf_conntrack is 0 % full [08:55:39] RECOVERY - RAID on analytics1047 is OK: OK: optimal, 13 logical, 14 physical [08:55:39] RECOVERY - dhclient process on analytics1047 is OK: PROCS OK: 0 processes with command name dhclient [08:55:56] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1669904 (10fgiunchedi) [08:55:59] 6operations, 10RESTBase, 10RESTBase-Cassandra: secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#1669905 (10fgiunchedi) [08:56:02] 6operations, 10RESTBase, 10RESTBase-Cassandra: use non-default credentials when authenticating to Cassandra - https://phabricator.wikimedia.org/T92590#1669902 (10fgiunchedi) 5Open>3Resolved yep, this has been deployed everywhere now [08:59:08] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1669918 (10fgiunchedi) we're live with inter-dc encryption for cassandra in production and test, we are still missing a way to track expiration of certs/ca, ak... [09:13:00] (03PS2) 10Filippo Giunchedi: WIP: configure RESTBase for codfw datacenter [puppet] - 10https://gerrit.wikimedia.org/r/240578 (https://phabricator.wikimedia.org/T108613) (owner: 10Eevans) [09:20:22] (03Abandoned) 10Filippo Giunchedi: swiftrepl: sync object timestamp [software] - 10https://gerrit.wikimedia.org/r/167828 (owner: 10Filippo Giunchedi) [09:21:50] (03PS1) 10Jcrespo: Removing references to es1001-es1010 due to decommision [puppet] - 10https://gerrit.wikimedia.org/r/240656 [09:27:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] Fix definition of $DEPLOYABLE_NETWORKS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/240647 (https://phabricator.wikimedia.org/T113351) (owner: 10Muehlenhoff) [09:35:34] (03PS8) 10Alexandros Kosiaris: squid: logrotate for webproxy on carbon [puppet] - 10https://gerrit.wikimedia.org/r/239009 (https://phabricator.wikimedia.org/T97119) (owner: 10Dzahn) [09:35:40] (03CR) 10Alexandros Kosiaris: [V: 032] squid: logrotate for webproxy on carbon [puppet] - 10https://gerrit.wikimedia.org/r/239009 (https://phabricator.wikimedia.org/T97119) (owner: 10Dzahn) [09:37:36] (03CR) 10Jcrespo: "There are still references of es masters on the coredb role, but those are deprecated and unused. It is not worth maintain that file anymo" [puppet] - 10https://gerrit.wikimedia.org/r/240656 (owner: 10Jcrespo) [09:37:57] 6operations, 10RESTBase: uneven load on restbase workers - https://phabricator.wikimedia.org/T113579#1670007 (10fgiunchedi) 3NEW [09:40:44] !log performing latest (software) steps to decom es1001-es1010 (puppet disabling, etc.) [09:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:41:18] (03CR) 10Zfilipin: WIP Move Ruby related packages to a separate file (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [09:41:56] (03PS5) 10Zfilipin: WIP Move Ruby related packages to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) [09:43:38] (03PS6) 10Zfilipin: WIP Move Ruby related packages to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) [09:43:55] (03PS7) 10Zfilipin: Move Ruby related packages to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) [09:44:45] (03PS2) 10Jcrespo: Removing references to es1001-es1010 due to decommision [puppet] - 10https://gerrit.wikimedia.org/r/240656 [09:53:02] (03PS1) 10Hashar: contint: migrate ops dependencies to a new class [puppet] - 10https://gerrit.wikimedia.org/r/240659 [09:58:06] (03CR) 10Jcrespo: [C: 032] Removing references to es1001-es1010 due to decommision [puppet] - 10https://gerrit.wikimedia.org/r/240656 (owner: 10Jcrespo) [09:58:18] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on integration puppetmaster and it is fine." [puppet] - 10https://gerrit.wikimedia.org/r/240659 (owner: 10Hashar) [09:58:43] <_joe_> valhallasw`cloud: around? if so, what populates the redis that dynamicproxy reads? [09:59:48] _joe_: not sure about special:novaproxy, but on toollabs it's ... err... something the `webservice` calls. Lemme see. [10:00:12] <_joe_> valhallasw`cloud: I mean on toollabs :) [10:00:41] it's portgrabber, I think [10:01:17] _joe_: https://github.com/wikimedia/operations-puppet/blob/acacf97e2df962fef83487a461f3559fa07e4d6f/modules/toollabs/files/portgrabber.py [10:01:30] <_joe_> valhallasw`cloud: yeah makes sense [10:01:33] <_joe_> thanks [10:02:46] ah, no [10:02:47] it's https://github.com/wikimedia/operations-puppet/blob/acacf97e2df962fef83487a461f3559fa07e4d6f/modules/toollabs/files/proxylistener.py [10:02:58] that's also where the authentication is [10:03:06] (identd) [10:14:12] 6operations, 7Graphite: Upgrade to Grafana v2.x - https://phabricator.wikimedia.org/T104738#1670142 (10Krinkle) [10:16:37] (03PS2) 10Filippo Giunchedi: WIP: report swift containers aggregated stats [puppet] - 10https://gerrit.wikimedia.org/r/240358 [10:17:24] (03CR) 10jenkins-bot: [V: 04-1] WIP: report swift containers aggregated stats [puppet] - 10https://gerrit.wikimedia.org/r/240358 (owner: 10Filippo Giunchedi) [10:21:03] (03PS1) 10Jcrespo: Removing dns entries for es1001-es1010 for decom [dns] - 10https://gerrit.wikimedia.org/r/240668 [10:23:05] (03CR) 10Jcrespo: [C: 04-1] "Do not remove mgmt entries yet." [dns] - 10https://gerrit.wikimedia.org/r/240668 (owner: 10Jcrespo) [10:29:07] 6operations, 10ops-eqiad: Decommission es1001-es1010 - https://phabricator.wikimedia.org/T113080#1670175 (10jcrespo) @Cmjohnson adding you to the ticket so that we can coordinate. The current state of es1001-es1010 is: * service: depooled. MySQL has been stopped for over a week with no issues. * Servers are... [10:42:38] bye bye icinga errors [10:53:46] (03CR) 10BBlack: improve XFF/XFP/XRIP code in common VCL (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/240582 (owner: 10BBlack) [10:58:24] (03PS4) 10BBlack: improve XFF/XFP/XRIP code in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/240582 [11:01:14] (03PS1) 10Muehlenhoff: Revert "Fix definition of deployable networks" [puppet] - 10https://gerrit.wikimedia.org/r/240673 [11:04:27] (03PS1) 10Alexandros Kosiaris: Revert "Backup home_pmtpa on bast1001" [puppet] - 10https://gerrit.wikimedia.org/r/240674 (https://phabricator.wikimedia.org/T113265) [11:07:28] !log upgrading varnishes to 3.0.6plus-wm8 (non-restarting, just pkg update on-disk) [11:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:08:21] 6operations, 5Patch-For-Review: Backup and decom home_pmtpa - https://phabricator.wikimedia.org/T113265#1670303 (10akosiaris) >>! In T113265#1669251, @Dzahn wrote: > backup done: > > > ``` > The defined FileSet resources are: > 1: home > 2: home_pmtpa > Select FileSet resource (1-2): 2 > +--------+... [11:09:11] 6operations, 5Patch-For-Review: Backup and decom home_pmtpa - https://phabricator.wikimedia.org/T113265#1670304 (10akosiaris) >>! In T113265#1669321, @Dzahn wrote: > was this all or should anything be done on nas1001-a.eqiad? If so, please do. Yes. unexport the volume and set it offline. I 'll irrevocably des... [11:09:39] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Backup home_pmtpa on bast1001" [puppet] - 10https://gerrit.wikimedia.org/r/240674 (https://phabricator.wikimedia.org/T113265) (owner: 10Alexandros Kosiaris) [11:14:07] !log stopping pybal on lvs300[12]; lvs300[34] taking over [11:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:15:30] 6operations, 5Patch-For-Review: Backup and decom home_pmtpa - https://phabricator.wikimedia.org/T113265#1670311 (10akosiaris) 5Open>3Resolved a:3akosiaris Backup configuration reverted, resolving [11:15:38] PROBLEM - Disk space on cp1065 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=79%) [11:15:45] nice :P [11:17:28] bblack: cp1065 is exactly the one you mentioned in that comment in https://gerrit.wikimedia.org/r/240582. coincidence ? [11:17:33] someone salt'd a tail -f on a logfile? :P [11:17:37] what ? [11:17:40] pfff [11:18:09] PROBLEM - pybal on lvs3002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [11:18:12] akosiaris: unrelated, but what pushed cp1065 over the edge is the package upgrades running right now, I think [11:18:25] and... I forgot to go downtime in icinga for lvs300[12] pybal :P [11:19:10] PROBLEM - pybal on lvs3001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [11:19:12] also killing that tail -f meant 400 GB freed [11:19:20] sigh [11:19:21] MB [11:19:57] and sudo apt-get clean another 1.5GB [11:20:18] hmm how about puppet-run does an apt-get clean as well ? [11:20:23] paravoid: bblack ^ ? [11:20:42] although I 've occasionally have come to love that cache [11:21:03] mostly to get back a version of a package no longer anywhere [11:21:10] RECOVERY - Disk space on cp1065 is OK: DISK OK [11:21:20] yeah ... [11:21:39] the main problem on cp1065 is /var/log/varnish/varnish.log, but I'm not even sure why yet ... [11:21:57] !log killed tail -f varnishncsa.log on cp1065 and ran apt-get clean to reclaim some disk space [11:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:23:44] I'm still investigating that, it exists on some and not others. I hope it's not appearing as a result of the package upgrade [11:24:24] I think it is :P [11:25:53] fixing it [11:31:08] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [11:34:52] (03CR) 10Muehlenhoff: Update mod_status configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/239998 (owner: 10Ori.livneh) [11:36:02] !log reinstall lvs300[12] to jessie - T96375 [11:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:37:20] (03Abandoned) 10Muehlenhoff: Fix definition of $DEPLOYABLE_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/240647 (https://phabricator.wikimedia.org/T113351) (owner: 10Muehlenhoff) [11:49:38] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [11:56:40] !log restarting varnish daemons on half of maps, parsoid, misc clusters (package upgrade, shm_reclen change) [11:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:58:04] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1670368 (10mark) a:3coren [12:04:51] 6operations, 10RESTBase, 6Services, 3Mobile-Content-Service, 7Varnish: Enable caching for the Mobile Content Service's RESTBase public endpoints - https://phabricator.wikimedia.org/T113591#1670371 (10mobrovac) 3NEW [12:05:49] !log installed rpcbind security updates on eeden, baham, radon, maerlant, rhenium [12:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:14:51] Things are slow or is it just me? [12:19:03] there was a strange curve few minutes ago: https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&me=Wikimedia&m=cpu_report&s=by+name&mc=2&g=network_report [12:19:29] same on elastic cluster: https://ganglia.wikimedia.org/latest/?c=Elasticsearch%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [12:22:31] looking [12:22:33] bblack: ^ [12:22:52] Sep 24 12:14:57 cr1-esams rpd[1475]: bgp_recv: peer 10.20.0.11 (External AS 64600): received unexpected EOF [12:22:55] Sep 24 12:14:57 cr1-esams rpd[1475]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.20.0.11 (External AS 64600) changed state from Established to Idle (event Closed) (instance master) [12:22:59] did someone restart pybals? [12:24:06] faidon@lvs3001:~$ uptime 12:24:01 up 12 min, 2 users, load average: 0.11, 0.27, 0.22 [12:24:25] that's bblack I assume :) [12:24:57] SAL confirms [12:26:07] 6operations, 10RESTBase, 6Services, 3Mobile-Content-Service, 7Varnish: Enable caching for the Mobile Content Service's RESTBase public endpoints - https://phabricator.wikimedia.org/T113591#1670471 (10BBlack) Currently, the VCL in the text and mobile clusters (identical in this regard) aren't messing with... [12:26:22] paravoid: yes [12:26:39] I think a hiccup happened during the lvs3001 reboot, but it was fixed pretty quickly [12:26:40] 6operations: Reduce rpcbind use - https://phabricator.wikimedia.org/T106477#1670472 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [12:26:52] well, https://gdash.wikimedia.org/dashboards/reqsum/ :) [12:27:02] seems to have been fixed quickly, yes [12:27:20] it looks like it was a case of bad service dependencies defined at the OS level, I *think* [12:27:46] how do you mean? [12:28:22] it happened when lvs3001 was definitely fully-puppetized and working, and I did a normal reboot (just to fix up eth params stuff). pybal came up and talked BGP, but some of the service IPs were showing no backends in "ipvsadm". I suspect it may have raced with or started before wikimedia-lvs-realserver defining IPs, or something along those lines. [12:28:31] restarting the pybal daemon fixed the ipvsadm output and such [12:28:36] hmmm [12:28:38] interesting [12:28:39] and scary [12:28:42] yeah [12:29:06] it was spewing stuff like this before the pybal restart in syslog: [12:29:07] [ 208.248883] IPVS: wrr: TCP 91.198.174.192:80 - no destination available [12:29:10] [ 208.248889] IPVS: sh: TCP 91.198.174.192:443 - no destination available [12:29:36] right [12:29:40] (03PS2) 10Mobrovac: Use DNS names in Hiera data for restbase in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/240642 (https://phabricator.wikimedia.org/T113528) (owner: 10Muehlenhoff) [12:30:01] I have a copy of the ipvsadm output anyways, will make a ticket [12:31:00] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1670489 (10Krenair) I think we'd also want upload.beta.wmflabs.org, maybe stream.wmflabs.org, all of the m./zero. variants? What about mx.beta.w... [12:32:08] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1670492 (10coren) >>! In T111123#1666433, @mark wrote: > Let's also investigate what we need to do to allow him to do OS installs, this should become possible for people to do with... [12:35:30] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1670497 (10mark) >>! In T111123#1670492, @coren wrote: >>>! In T111123#1666433, @mark wrote: >> Let's also investigate what we need to do to allow him to do OS installs, this shoul... [12:35:35] 6operations, 10ops-esams: Rack and configure asw-esams (new 2xQFX5100 stack) - https://phabricator.wikimedia.org/T91643#1670498 (10Cmjohnson) 5Open>3Resolved Updated racktables with S/N, purchase dates, RT# and warranty info [12:36:36] PROBLEM - service on lvs3001 is CRITICAL: CRITICAL - Expecting active but unit is inactive [12:37:06] PROBLEM - service on lvs3002 is CRITICAL: CRITICAL - Expecting active but unit is inactive [12:38:24] ^ what does that even mean? [12:38:30] lol [12:38:56] 6operations, 10Traffic, 7Pybal: pybal-related issue on host start can break service IPs... - https://phabricator.wikimedia.org/T113597#1670504 (10BBlack) 3NEW [12:39:00] 6operations, 10ops-eqiad: RMA Samsung EVO ssds - https://phabricator.wikimedia.org/T107326#1670513 (10Cmjohnson) I spoke with our rep with Dell and a straight return is probably not going to be possible. They have a 21 day return policy and we were never going to make that. I am working on a swap for Samsu... [12:39:11] 6operations, 10Traffic, 7Pybal: pybal-related issue on host start can break service IPs... - https://phabricator.wikimedia.org/T113597#1670516 (10BBlack) [12:40:43] ah found it [12:40:47] RECOVERY - service on lvs3002 is OK: OK - confd is active [12:40:56] "confd" was not running - apparently systemd doesn't start it up, only puppet does [12:41:04] 6operations, 10Beta-Cluster, 10RESTBase, 6Services, 5Patch-For-Review: Firewall rules too restrictive on deployment-restbase0x.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T113528#1670523 (10mobrovac) I cherry-picked the patch on `deployment-puppetmaster.deployment-prep.eqiad.wmflabs... [12:41:09] (which I had disabled at that moment) [12:41:56] no unit name in the alert? heh [12:42:16] RECOVERY - service on lvs3001 is OK: OK - confd is active [12:42:37] only in the recovery I guess [12:44:11] I'll fix it real quick [12:45:47] (03PS1) 10coren: Fix ssh public key for junikowski [puppet] - 10https://gerrit.wikimedia.org/r/240684 (https://phabricator.wikimedia.org/T113298) [12:48:04] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1670546 (10BBlack) [12:48:19] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1670547 (10BBlack) 5Open>3Resolved a:3BBlack [12:49:24] (03PS1) 10Filippo Giunchedi: nrpe: report unit name on messages [puppet] - 10https://gerrit.wikimedia.org/r/240685 [12:49:28] (03PS2) 10coren: Add asherman to researchers [puppet] - 10https://gerrit.wikimedia.org/r/240369 (https://phabricator.wikimedia.org/T113118) [12:49:30] bblack: ^ [12:49:59] 6operations, 10Beta-Cluster, 10RESTBase, 6Services, 5Patch-For-Review: Firewall rules too restrictive on deployment-restbase0x.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T113528#1670551 (10mobrovac) After cherry-picking [PS 240673](https://gerrit.wikimedia.org/r/#/c/240673/) as wel... [12:50:39] !log restarting varnishd instances on text, mobile, upload clusters for package upgrade (slow salt, no parallelism, ~5m spacing - FE cache loss, BE cache stays, should take ~9h) [12:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:51:00] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1670552 (10Krenair) (b) was already done in {T109640} [12:51:29] (03PS3) 10coren: Add asherman to researchers [puppet] - 10https://gerrit.wikimedia.org/r/240369 (https://phabricator.wikimedia.org/T113118) [12:52:21] (03PS1) 10Alexandros Kosiaris: caching proxy: Remove the access_log none directive [puppet] - 10https://gerrit.wikimedia.org/r/240686 [12:52:37] (03CR) 10Mobrovac: [C: 031] "Tested, works in conjunction with https://gerrit.wikimedia.org/r/#/c/240673/" [puppet] - 10https://gerrit.wikimedia.org/r/240642 (https://phabricator.wikimedia.org/T113528) (owner: 10Muehlenhoff) [12:52:50] (03CR) 10coren: [C: 032] "+asherman to researchers" [puppet] - 10https://gerrit.wikimedia.org/r/240369 (https://phabricator.wikimedia.org/T113118) (owner: 10coren) [12:52:55] (03PS2) 10Muehlenhoff: Revert "Fix definition of deployable networks" [puppet] - 10https://gerrit.wikimedia.org/r/240673 [12:54:03] 10Ops-Access-Requests, 6operations, 10Wikimedia-Blog, 5Patch-For-Review: stat1003/EventLogging access for asherman - https://phabricator.wikimedia.org/T113118#1670560 (10coren) 5Open>3Resolved With L3 signed and the new key, this is now pushed to production. [12:54:13] !log restarting varnish daemons on second half of maps, parsoid, misc clusters (package upgrade, shm_reclen change) [12:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:55:07] PROBLEM - Disk space on mw1152 is CRITICAL: DISK CRITICAL - free space: /tmp 98 MB (0% inode=99%) [12:55:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Revert "Fix definition of deployable networks" [puppet] - 10https://gerrit.wikimedia.org/r/240673 (owner: 10Muehlenhoff) [12:55:52] (03PS2) 10coren: Add chedasaurus to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/240371 (https://phabricator.wikimedia.org/T113302) [12:56:53] (03CR) 10coren: [C: 032] Add chedasaurus to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/240371 (https://phabricator.wikimedia.org/T113302) (owner: 10coren) [12:57:42] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1670572 (10coren) 5Open>3Resolved This has now been pushed to production. [12:58:56] RECOVERY - Disk space on mw1152 is OK: DISK OK [12:59:49] mutante: Do you know what group need to be given for the access requested in https://phabricator.wikimedia.org/T113298 ? [13:02:06] I really wish people would give group names instead of host names [13:02:14] unless asking for a new group to be set up [13:02:14] (03CR) 10BBlack: [C: 031] nrpe: report unit name on messages [puppet] - 10https://gerrit.wikimedia.org/r/240685 (owner: 10Filippo Giunchedi) [13:02:44] Hi, investigating a Flow board issue in production (on es.wikiquote.org). What is the right way for me to look at the data? [13:02:47] (03PS3) 10Muehlenhoff: Use DNS names in Hiera data for restbase in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/240642 (https://phabricator.wikimedia.org/T113528) [13:03:03] to look at the data? what? [13:03:20] Krenair: the flow tables in the db [13:03:24] (03CR) 10Muehlenhoff: [C: 032 V: 032] Use DNS names in Hiera data for restbase in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/240642 (https://phabricator.wikimedia.org/T113528) (owner: 10Muehlenhoff) [13:03:38] oh, you just want someone to show you how to connect to the x1 db stephanebisson? [13:04:27] Krenair: if you say so. Will you show me? [13:05:06] do you have deployment/restricted access? [13:05:56] I don't think you do... [13:06:11] Coren: stat1002 access is analytics-privatedata-users afaik (as I was recently added) [13:06:15] Krenair: probably not. I can connect to deployment-fluorine and deployment-bastion but that's abou tit [13:06:34] stephanebisson, those are deployment-prep hosts in labs, not production [13:06:38] entirely separate thing [13:06:45] that's what I thought [13:07:12] is there a way for me to 'look' at the data without having prod access? [13:07:19] 6operations, 10Beta-Cluster, 10RESTBase, 6Services, 5Patch-For-Review: Firewall rules too restrictive on deployment-restbase0x.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T113528#1670591 (10mobrovac) 5Open>3Resolved a:3mobrovac Merged, resolving. [13:07:21] no [13:07:24] https://wikitech.wikimedia.org/wiki/Requesting_shell_access#New_User_Access [13:08:51] Krenair: I tried that but it seems I don't _absolutely_ need prod access so it was rejected [13:09:04] Krenair: is flow stuff no replicated to labs? :O [13:09:25] and is that a case of its not going to happen or just it hasn't happened yet? [13:09:25] addshore, I would be surprised if it was [13:09:35] stephanebisson, oh, this one: https://phabricator.wikimedia.org/T107886 [13:09:47] https://phabricator.wikimedia.org/T69397 [13:09:59] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1670596 (10Lixxx235) Chmarkine, there's always StartCom/StartSSL which has free certs, and they're already trusted by default in all major brows... [13:10:17] Krenair: yes, that was last time [13:10:57] stephanebisson, yeah, you're very unlikely to get the ability to +2 to operations/mediawiki-config without the ability to then deploy the change [13:11:33] I can think of about one person who can do that without being a gerrit admin [13:11:43] that makes sense, and I would really prefer not to have prod access [13:11:56] addshore: Ah, ty. It's the "need access to stat1002" which wasn't explicit. [13:12:27] (03PS2) 10Alexandros Kosiaris: caching proxy: Remove the access_log none directive [puppet] - 10https://gerrit.wikimedia.org/r/240686 [13:14:57] Coren: I have a question I suspect you'll be able to answer! :) I was just writing a script on the analytics cluster (stat1002 actually) which planned on doing a HTTP get/post to lists.wikimedia.org, but it is unsuccessful as the domain resolves to 10.64.5.3. [13:15:37] Is this the case for all external domains inside production? I see if also happens when pinging google etc. [13:15:58] Arbitary access to the internet will be restricted [13:16:38] awesome! im guessing for lists.wikimedia.org I can just use the internal name anyway [13:17:23] addshore: I don't think fermium /has/ an internal IP. [13:17:35] hmm, okay [13:18:21] addshore: Which kinda surprises me if you say that it resolves to a 10/8 IP from stats1002 actually [13:18:40] yeh it does, but every other domain also resolves to the same ip [13:18:58] what's rdns for the ip tell you? [13:19:11] marc@stat1002:~$ host lists.wikimedia.org [13:19:13] lists.wikimedia.org has address 208.80.154.75 [13:19:13] ... [13:19:52] what's up? [13:20:06] meh, I was pinging [13:20:29] and now realise im tried and looking at totally the wrong ip.. [13:20:36] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [13:20:41] * addshore slaps self with a large trout.... [13:20:43] On stat1003 pinging lists.wikimedia.org gives you "From ae3-1022.cr2.eqiad.wikimedia.org (10.64.36.3) icmp_seq=3 Packet filtered" [13:21:09] I don't know where the 10.64.5.3 came from though [13:21:13] yeh... the IP I gave you was the IP fo stat1002 ;) [13:21:19] ah, hah [13:21:28] yeh... [13:21:33] might need a coffee..... [13:21:48] addshore: Oh, hah. At any rate, afaict, http[s] access to lists from stat1002 works. [13:21:57] awesome :) [13:22:20] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] caching proxy: Remove the access_log none directive [puppet] - 10https://gerrit.wikimedia.org/r/240686 (owner: 10Alexandros Kosiaris) [13:22:30] *goes to look at which other bits of his script are thus not working, im guessing everything requesting stuff from the outside world [13:22:52] Krenair: that's stat1003's gateway address [13:23:08] (sort of, it's a bit more complicated than that) [13:23:31] as the reverse DNS says, that's cr2-eqiad, i.e. the router [13:23:44] and the packet filtered means that there is an ACL on the router forbidding this type of traffic [13:24:11] addshore, I think some stuff connecting to the outside world might not work? if not, try http_proxy=webproxy:8080 [13:25:54] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1670603 (10BBlack) 5Open>3Resolved The above commit should be live on all prod + beta varnishes... [13:26:13] 6operations, 10Analytics-EventLogging, 10MediaWiki-extensions-NavigationTiming, 6Performance-Team: Increase maxUrlSize from 1000 to 1500 - https://phabricator.wikimedia.org/T112002#1670606 (10BBlack) The varnish change should be live on all production and beta varnishes now, raising the limit to ~2K there. [13:26:29] (03PS1) 10coren: Add junikowski to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/240691 (https://phabricator.wikimedia.org/T113298) [13:31:08] (03PS2) 10Filippo Giunchedi: nrpe: report unit name on messages [puppet] - 10https://gerrit.wikimedia.org/r/240685 [13:31:29] (03PS3) 10Filippo Giunchedi: WIP: report swift containers aggregated stats [puppet] - 10https://gerrit.wikimedia.org/r/240358 [13:31:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] nrpe: report unit name on messages [puppet] - 10https://gerrit.wikimedia.org/r/240685 (owner: 10Filippo Giunchedi) [13:32:52] (03PS1) 10BBlack: Add dhcp macaddr for lvs1007-11 T104458 [puppet] - 10https://gerrit.wikimedia.org/r/240695 [13:34:11] (03CR) 10BBlack: [C: 032] Add dhcp macaddr for lvs1007-11 T104458 [puppet] - 10https://gerrit.wikimedia.org/r/240695 (owner: 10BBlack) [13:34:21] Krenair: ahh coool https://wikitech.wikimedia.org/wiki/Http_proxy [13:34:25] hadn't found that yet! [13:34:33] there's also url-downloader.wikimedia.org:8080 [13:34:46] only difference I've found so far is that url-downloader has a file size limit [13:35:47] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:40:16] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1670622 (10coren) @papaul: All I'm missing now from you is the public key to a new SSH keypair as Mark outlined above for your access to prod. [13:42:12] (03PS2) 10coren: Add junikowski to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/240691 (https://phabricator.wikimedia.org/T113298) [13:43:07] (03CR) 10coren: [C: 032] Add junikowski to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/240691 (https://phabricator.wikimedia.org/T113298) (owner: 10coren) [13:43:11] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1670626 (10BBlack) Kicked off install of lvs1007-11 this morning. 10, 11 (row C, like lvs1012 already done) are booting PXE->installer fine. lvs00[789] seem to... [13:43:59] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1670629 (10coren) 5Open>3Resolved Done, with the new key. [13:44:47] 10Ops-Access-Requests, 6operations: Add Matanya to "restricted" to perform server side uploads - https://phabricator.wikimedia.org/T106447#1670632 (10Krenair) @mark: This has been waiting for two months now. [13:45:12] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1670635 (10coren) p:5Triage>3Normal @yuvipanda: Is this stalled on input from @chasemp in re the sudo rule? [13:49:09] (03PS1) 10Faidon Liambotis: Correct typo for ae3-1022.cr2-eqiad PTR [dns] - 10https://gerrit.wikimedia.org/r/240697 [13:49:10] (03PS1) 10Faidon Liambotis: Repool codfw [dns] - 10https://gerrit.wikimedia.org/r/240698 [13:49:22] bblack: ^ [13:49:47] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#1670639 (10coren) 5Open>3Resolved As far as I can tell, this request is complete with @ottomata having done the sync. [13:53:05] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Dan Foy - https://phabricator.wikimedia.org/T113324#1670644 (10coren) @DFoy: Before we can proceed, you need: (a) to provide us with a public ssh key that is not used anywhere else; and (b) to review and sign L3 if you have not yet done so;... [13:53:47] Krenair: heh, PHP Fatal error: Call to undefined function curl_init() on stat1002 :P [13:54:57] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Neil P. Quinn - https://phabricator.wikimedia.org/T113533#1670649 (10coren) p:5Triage>3Normal Happy fun three day clock started. [13:55:22] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1670654 (10Ottomata) Awesome, thanks @BBlack! [13:55:51] (03PS4) 10Filippo Giunchedi: WIP: report swift containers aggregated stats [puppet] - 10https://gerrit.wikimedia.org/r/240358 [14:04:13] 6operations, 10Analytics-Cluster: php5-curl for stat1002 - https://phabricator.wikimedia.org/T113602#1670683 (10Addshore) 3NEW a:3Ottomata [14:07:01] (03PS1) 10Ottomata: Install php5-curl on statistics compute node (stat100[23]) [puppet] - 10https://gerrit.wikimedia.org/r/240700 (https://phabricator.wikimedia.org/T113602) [14:10:04] addshore, no php_curl package? [14:10:45] same on stat1003 [14:10:58] php5-curl* [14:11:42] (03CR) 10Addshore: [C: 031] Install php5-curl on statistics compute node (stat100[23]) [puppet] - 10https://gerrit.wikimedia.org/r/240700 (https://phabricator.wikimedia.org/T113602) (owner: 10Ottomata) [14:12:11] (03CR) 10Ottomata: [C: 032] Install php5-curl on statistics compute node (stat100[23]) [puppet] - 10https://gerrit.wikimedia.org/r/240700 (https://phabricator.wikimedia.org/T113602) (owner: 10Ottomata) [14:15:35] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#1670713 (10bmansurov) Thanks all! [14:23:35] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1670741 (10Papaul) @Coren how do you want me to send you my production key? [14:24:25] Krinkle: do you know where eventlogging.client_errors comes from? [14:24:35] ottomata: Yes, I introduced that last week. [14:24:40] ah! [14:24:42] where? how? [14:25:03] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1670761 (10coren) >>! In T111123#1670741, @Papaul wrote: > @Coren how do you want me to send you my production key? You can simply paste it here on the ticket; that's pretty much... [14:25:04] am searching puppet and eventlogigng code [14:25:08] ottomata: https://github.com/wikimedia/mediawiki-extensions-EventLogging/commit/a019982d48b63c59b94b3e376bfde672f64d9fd9 [14:25:08] https://phabricator.wikimedia.org/T112592 [14:25:22] ottomata: until now, the js client was dropping event without any signal [14:25:29] now/lastweek [14:25:30] OH from client. [14:25:40] Yeah, and that's like 10% for some schemas [14:25:42] ah i was just searching server code [14:25:43] ah hm. [14:25:45] and 90% for other schemas [14:25:53] where the properties are too long but nobody cared to check [14:26:16] coool, [14:26:22] On https://grafana.wikimedia.org/#/dashboard/db/eventlogging I added a section that tracks the trends [14:26:32] And on https://grafana.wikimedia.org/#/dashboard/db/eventloggingschema you can inspect individual ones and their error rate [14:26:42] Krenair: what is maxUrlSize [14:26:42] ? [14:26:44] 1024? [14:26:51] For example, https://grafana.wikimedia.org/#/dashboard/db/eventloggingschema?from=now-12h&var-schema=MultimediaViewerNetworkPerformance has like 50% failure rate for url size [14:26:52] Krinkle, ^ [14:26:59] oops [14:27:02] i mean Krenair sorry :p [14:27:04] ahh [14:27:08] Krinkle: [14:27:14] ding dang auto complete [14:28:03] Krinkle: i ask because bblack just merged and applied this [14:28:04] https://phabricator.wikimedia.org/rOPUP5be836ab0b59314c5e4fe8dd43f4c583135a7c5a [14:28:12] ottomata: It used to be 1000, enforced client-side before it even makes the beacon request. Based on the fully formed url (json encoded, url encoded, including path, and hostname, though protocol relative, so it's off by 5 characters but that's fine since server-side has 1024 [14:28:19] https://github.com/wikimedia/mediawiki-extensions-EventLogging/commit/62ecf20edaa [14:28:29] Yeah, it's now 2000 client side [14:28:35] woot :) [14:28:36] great. [14:28:41] I don't know if that's deployed yet. I proposed that bump commit last week, and I see someone merged it now. [14:28:50] Krinkle: ya, i just did [14:28:55] bblack says [14:29:02] "The above commit should be live on all prod + beta varnishes now, raising the varnish log shm limit to ~2K." [14:29:10] cool [14:29:22] ah, ok cool, so you all are in sync [14:29:25] awesome. [14:29:46] Krinkle: we can test it in beta labs if you want now if you have an event over the size [14:30:17] I'm not sure what you mean. [14:33:50] Krinkle: we can test eventlogging on the beta cluster as we have a testing instance there. See: https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/TestingOnBetaCluster [14:34:13] so if you send an event like: https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/TestingOnBetaCluster#How_to_log_a_client-side_event_to_Beta_Cluster_directly [14:34:22] we can see whether it appears on logs [14:35:18] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1670805 (10Papaul) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCmlt2EID9N7s/YwZ91Ma4t+/M4aQEGx6Qzzwz+pFOC/CNiz43fw89fHsMmrrU+e6T9hI9nTz899EVaxM8CG9kmGtpYOtcpoFkcMhFlnP9JZpuUzmE/DVVDXBFnr... [14:36:14] (03CR) 10BBlack: [C: 031] Repool codfw [dns] - 10https://gerrit.wikimedia.org/r/240698 (owner: 10Faidon Liambotis) [14:36:21] (03PS2) 10Andrew Bogott: contint: migrate ops dependencies to a new class [puppet] - 10https://gerrit.wikimedia.org/r/240659 (owner: 10Hashar) [14:37:03] (03CR) 10Faidon Liambotis: [C: 032] Correct typo for ae3-1022.cr2-eqiad PTR [dns] - 10https://gerrit.wikimedia.org/r/240697 (owner: 10Faidon Liambotis) [14:37:09] (03CR) 10Faidon Liambotis: [C: 032] Repool codfw [dns] - 10https://gerrit.wikimedia.org/r/240698 (owner: 10Faidon Liambotis) [14:37:27] !log repooling codfw [14:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:35] Krinkle: i didn't know about that dynamic var substitution thing either, that is useful! [14:38:13] nuria: My own schemas are fine, and I don't have the bandwidth right now to test other schemas. But I'll keep it in mind. [14:38:23] (03CR) 10Andrew Bogott: [C: 032] contint: migrate ops dependencies to a new class [puppet] - 10https://gerrit.wikimedia.org/r/240659 (owner: 10Hashar) [14:39:40] Krinkle: i see that your $schema var template grabs from eventlogging.schema.* [14:39:58] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1670836 (10Chmarkine) >>! In T50501#1670596, @Lixxx235 wrote: > Chmarkine, there's always StartCom/StartSSL which has free certs, and they're al... [14:39:59] ottomata: Meh, yeah. [14:39:59] i think that wont' work anymore? unless it is being produced some other way...i guess through the client errors thing? [14:40:07] ottomata: well, it's still stored in graphite [14:40:48] yeah, but new schemas won't show up there? and i dunno much about how graphite keeps stuff [14:40:52] it's not like it queries statsd and tries to infer magically what properties may be used in the near future. All graphite has is past data. [14:41:01] Yeah, but I'm not sure where else to get this list [14:41:09] the one for client errors is under a different subproperty [14:41:12] yeah, am trying out of kafka, but will have to regex it i guess [14:41:16] trying now [14:41:17] and only contains ones that have errors, not nearly complete [14:41:23] Ah yeah, you can regex it [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150924T1500). Please do the needful. [15:00:04] kart_ jzerebecki irc-nickname: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:15] here [15:00:17] o/ [15:00:35] who is irc-nickname :) [15:00:56] jouncebot only filters that if it is the first one [15:01:00] kart_: the template [15:01:27] we changed the example to use the ircnick template because people kept forgetting about it and just writing their name manually [15:01:39] now the bot tries poke the example user :( [15:02:19] kk, I can SWAT, jzerebecki I'll get yours merged so you can do composer updates. [15:02:57] thcipriani: thx it https://gerrit.wikimedia.org/r/#/c/240680/ needs a force merge as core wmf22 qunit tests are broken :( [15:03:40] Krenair: That example user....always bringing work to swat. [15:04:13] could put the example in a comment I guess [15:04:40] 1.22 doesn't run anywhere [15:04:44] wmf22 [15:04:44] does it? [15:04:57] think wikidata is still on wmf22? [15:04:58] Krinkle: wikidata last branched at wmf22 [15:05:14] Can wikidata just play like a normal extension? [15:05:17] so the core wmf24 branch is pointing to wikidata origin/wmf/wmf22 [15:05:29] (03PS1) 10Faidon Liambotis: mail: add wiki-mail-codfw's IPs to WIKI_INTERFACE [puppet] - 10https://gerrit.wikimedia.org/r/240709 [15:05:34] 1.26wmf22* [15:05:47] you're saying we're deploying wmf22 of an extension alongside wmf24 of mediawikki. [15:06:28] Krinkle: yes i created https://phabricator.wikimedia.org/T105638 RFC: Streamlining Composer usage to make wikidata be able to play like a normal extension, it is blocked on adding signature support to composer [15:06:54] (03PS2) 10Faidon Liambotis: mail: add wiki-mail-codfw's IPs to WIKI_INTERFACE [puppet] - 10https://gerrit.wikimedia.org/r/240709 [15:07:03] The simple solution would be to either: cut every week from master like the rest, or: keep as it is, but rename the branch in subsequent weeks so it matches infrastructure expectations. [15:07:14] The current way should never have been accepted. Doesn't make sense. [15:07:18] 6operations, 10RESTBase: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#1670888 (10mobrovac) >>! In T112648#1645274, @GWicke wrote: > It is solvable, for example by logging to syslog using udp with https://github.com/mcavage/node-bunyan-syslog. Yup, same thing I had in mind. /... [15:07:33] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mail: add wiki-mail-codfw's IPs to WIKI_INTERFACE [puppet] - 10https://gerrit.wikimedia.org/r/240709 (owner: 10Faidon Liambotis) [15:07:49] the composer stuff shouldn't have anything to do with it [15:08:04] thcipriani: you still need to press submit, otherwise jenkins will just -1 the patch [15:09:08] I've never force-merged before, but I've heard tell of this tripping up zuul, is that the case? [15:09:48] thcipriani: if it is finishing the gate job that you submitted it might block for 5min [15:11:42] jzerebecki: kk, done. [15:19:07] thcipriani: ok done. normal +2 needed when you are ready for the auto submodule update: https://gerrit.wikimedia.org/r/#/c/240711/ [15:19:25] i will afterwards prepare the submodule update for 23 [15:19:28] jzerebecki: kk, lemme get kart_ 's change out, should be pretty quick. [15:19:44] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240638 (https://phabricator.wikimedia.org/T112848) (owner: 10KartikMistry) [15:19:52] thcipriani: Yeah, Zuul will now go into limbo for 10-20 minutes until it figures out the change merged underneath it [15:19:53] (03Merged) 10jenkins-bot: Enable suggestions in ca, en, es, fr, it, ja, tr, ru, zh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240638 (https://phabricator.wikimedia.org/T112848) (owner: 10KartikMistry) [15:21:02] I don't know when wikibase started using out of sequence wmf branches, but regardless of how often they cut from master, they should not be using branches like this. I don't understand why the first deployment after that didn't raise questions. [15:21:09] How long has this been the case? [15:21:22] Krinkle: zuul seems to be fine this time [15:21:30] Krinkle: since years [15:22:57] jzerebecki: It's not fine. But I imagine nobody cares. What's happenign now is that this change is NOT merged inside the zuul cloner. Which means the next change to Wikibase wmf22 will have a state that does not contain this commit. And when other repositories run cross-repo tests for wmf22, it will use wmf22 of wikibase without this commit. [15:23:06] This is problematic in master and active branches [15:23:14] but since its wmf22 it's only affecting itself. [15:23:35] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable suggestions in ca, en, es, fr, it, ja, tr, ru, zh [[gerrit:240638]] (duration: 00m 17s) [15:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:42] ^ kart_ check please. [15:24:05] Krinkle: https://wikitech.wikimedia.org/wiki/How_to_deploy_Wikidata_code ? [15:24:21] Sure [15:24:43] jzerebecki: That's sad, very sad. I'll bring it up next releng meeting, this has to stop. Unless there is reason to, I'm fairly sure there is no justification for this. Again, I'm fine with Wikibase not branching from master as often, that's understandable, and even the composer stuff is okay. But there is no reason not to use current branch names. [15:25:12] It means you're not testing with what is deployed when backporting changes. Running blind. [15:25:35] Krinkle: I imagine we would be fine with our old branch being copied if we don't supply a new one [15:25:54] thcipriani: working, but hit by https://phabricator.wikimedia.org/T112964 [15:26:06] jzerebecki: That sounds like a good compromise. [15:26:12] Krinkle: can you look at https://phabricator.wikimedia.org/T112964 when have time? [15:26:37] blerg. ok. thanks for checking. [15:27:36] Krinkle: didn't know about the zuul-cloner problem, that sounds bad... [15:27:53] Yeah, we choose to use Zuul. And Zuul is very strict about standard. [15:27:54] (03PS1) 10coren: Add Papaul (pt1979) to bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/240716 (https://phabricator.wikimedia.org/T111123) [15:28:13] I'm not a fan of it, but it's what we got and until we do otherwise, we have to embrace it. [15:28:18] No force merging, and no non-matching branches. [15:29:25] kart_: OK. Let's continue in -dev? [15:30:22] kart_: #wikimedia-dev [15:32:05] Krinkle: sure [15:40:05] (03PS2) 10coren: Add Papaul (pt1979) to bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/240716 (https://phabricator.wikimedia.org/T111123) [15:41:37] PROBLEM - Disk space on mw1152 is CRITICAL: DISK CRITICAL - free space: /tmp 698 MB (3% inode=99%) [15:45:27] PROBLEM - Disk space on mw1152 is CRITICAL: DISK CRITICAL - free space: /tmp 706 MB (3% inode=99%) [15:45:52] 10Ops-Access-Requests, 6operations: RESTBase Admin access on aqs1001, aqs1002, and aqs1003 for Joseph and Dan - https://phabricator.wikimedia.org/T113416#1671032 (10kevinator) approved for @JAllemandou and @Milimetric [15:46:42] !log thcipriani@tin Synchronized php-1.26wmf24/extensions/Wikidata: SWAT: Do not filter affected pages by namespace [[gerrit:240711]] (duration: 00m 26s) [15:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:51] ^ jzerebecki check please [15:49:35] thcipriani: submodule update for 23: https://gerrit.wikimedia.org/r/#/c/240727/ [15:49:43] jzerebecki: kk [15:49:59] wmf24 look good? [15:50:09] (03CR) 10coren: [C: 032] Add Papaul (pt1979) to bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/240716 (https://phabricator.wikimedia.org/T111123) (owner: 10coren) [15:50:41] thcipriani: still checking [15:50:47] its complicated [15:50:50] okie doke. [15:54:13] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1671050 (10coren) 5Open>3Resolved Pushed to production. [15:58:05] thcipriani: it needs to be deployed for 23 before I can test. sorry for the confusion [15:58:27] jzerebecki: looks like one of the tests failed for wmf23 [15:58:33] https://integration.wikimedia.org/ci/job/mediawiki-phpunit-hhvm-composer/134/console [16:00:04] RobH bblack: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150924T1600). [16:00:04] irc-nickname: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:01:11] !log nothing on puppet swat window, easiest swat ever. [16:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:30] yuvipanda: i think this means your project here is a success =] [16:03:50] thcipriani: failure unrelated to the patch, previous run on wmf23 also had that error, also that job is not relevant for our cluster as it is testing with composer instead of vendor. the equivalent jobs using vendor passed. somone changed vendor but not core composer.json similar to https://phabricator.wikimedia.org/T113360 . [16:04:37] PROBLEM - Disk space on mw1152 is CRITICAL: DISK CRITICAL - free space: /tmp 337 MB (1% inode=99%) [16:10:26] jzerebecki: hmm, yeah, I see that all the previous wmf23 stuff was forced. For whatever reason I don't have a submit button here. [16:10:58] thcipriani: you need to remove jenkins-bot from reviewers first [16:11:09] thcipriani: oh and v+2 [16:15:46] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [16:15:50] eh? [16:16:26] seems to have already gone back down, according to gdash [16:17:29] !log thcipriani@tin Synchronized php-1.26wmf23/extensions/Wikidata: SWAT: Do not filter affected pages by namespace [[gerrit:240727]] (duration: 00m 26s) [16:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:36] ^ jzerebecki check please. [16:20:29] thcipriani: works. that fixed dispatching of changes to abitrary kittens for user pages ;) [16:20:32] thx [16:24:57] jzerebecki: yw! glad everything is working as expected. [16:27:06] (03PS1) 10Dzahn: annual: redirect 2007-2013 URLs to foundation wiki [puppet] - 10https://gerrit.wikimedia.org/r/240735 (https://phabricator.wikimedia.org/T113113) [16:28:57] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:29:06] 6operations, 10ops-codfw: wipe working spare disk in codfw - https://phabricator.wikimedia.org/T112783#1671169 (10RobH) Do you have a spare server not in use using 3.5" disks? If so, put it in here and assign to me for review and we can likely use it. [16:30:39] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1671172 (10Eevans) [16:36:12] 6operations, 10RESTBase, 10RESTBase-Cassandra: use non-default credentials when authenticating to Cassandra - https://phabricator.wikimedia.org/T92590#1671198 (10Eevans) Created {T113622}, to track the remaining issue. [16:36:48] 6operations, 10RESTBase, 10RESTBase-Cassandra: secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#1671209 (10Eevans) [16:38:39] legoktm: you there? [16:38:44] Steinsplitter: hi [16:39:08] legoktm: user with 182,648 asked for rename (likes to use his real name) [16:40:11] (03CR) 10Southparkfan: [C: 04-1] annual: redirect 2007-2013 URLs to foundation wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/240735 (https://phabricator.wikimedia.org/T113113) (owner: 10Dzahn) [16:40:12] Steinsplitter: ok, can we do it in 10 minutes? [16:40:19] sure , thanks :) [16:40:36] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [16:42:33] (03PS2) 10Dzahn: annual: redirect 2007-2013 URLs to foundation wiki [puppet] - 10https://gerrit.wikimedia.org/r/240735 (https://phabricator.wikimedia.org/T113113) [16:42:35] (03CR) 10Dzahn: annual: redirect 2007-2013 URLs to foundation wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/240735 (https://phabricator.wikimedia.org/T113113) (owner: 10Dzahn) [16:43:47] (03CR) 10Southparkfan: [C: 031] annual: redirect 2007-2013 URLs to foundation wiki [puppet] - 10https://gerrit.wikimedia.org/r/240735 (https://phabricator.wikimedia.org/T113113) (owner: 10Dzahn) [16:48:17] Steinsplitter: what are the old and new usernames? [16:48:32] old: https://meta.wikimedia.org/wiki/Special:CentralAuth/Agamitsudo [16:48:51] and Benoît Prieur is the new one [16:50:26] Steinsplitter: ok, I'm ready :) [16:50:38] ok, then i start with rename :) [16:53:46] legoktm: thanks, it is running (https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Beno%C3%AEt_Prieur). Will ping you if there are problems. [16:53:47] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:53:54] 6operations, 10MediaWiki-extensions-ZeroPortal, 10Traffic, 6Zero: zerofetcher in production is getting throttled for API logins - https://phabricator.wikimedia.org/T111045#1671323 (10Anomie) >>! In T111045#1669413, @Yurik wrote: > Do we know what is actually doing the rate check and blocking? Is that a bac... [16:54:04] :) [16:54:16] :) [16:59:02] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1671351 (10mmodell) [16:59:07] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1671352 (10mmodell) [17:02:51] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1671389 (10greg) [17:02:56] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1671388 (10greg) [17:03:19] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1645318 (10greg) [17:06:29] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1671428 (10greg) [17:06:53] 6operations, 10ops-eqiad: Wipe and disconnect mw1031 - https://phabricator.wikimedia.org/T113283#1671429 (10Cmjohnson) 5Open>3Resolved Wiped and disconnected. [17:06:58] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1314620 (10greg) (updated description to just talk about exposing ssh, since we split off websockets, which we still want, plzkthxbai) [17:07:19] PROBLEM - Disk space on mw1152 is CRITICAL: DISK CRITICAL - free space: /tmp 0 MB (0% inode=99%) [17:08:30] 6operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering-Team, 7HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#1671438 (10mmodell) >>! In T103886#1402775, @faidon wrote: >> Iterate on the graceful restart proced... [17:09:11] !log powering down for the last time es1001 - es1010 [17:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:10:00] <_joe_> oh the videoscaler, damn [17:10:48] <_joe_> !log cleaning up /tmp on mw1152 [17:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:08] 6operations, 10ops-eqiad: Decommission es1001-es1010 - https://phabricator.wikimedia.org/T113080#1671452 (10Cmjohnson) 5Open>3Resolved @jcrespo, I was going to remove dns entries and wipe but I see you -1 you're own dns commit. Let me know when it's safe to wipe. [17:11:16] RECOVERY - Disk space on mw1152 is OK: DISK OK [17:11:58] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/240578 (https://phabricator.wikimedia.org/T108613) (owner: 10Eevans) [17:12:36] (03CR) 10Ori.livneh: [C: 031] improve XFF/XFP/XRIP code in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/240582 (owner: 10BBlack) [17:14:28] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [17:14:38] 6operations, 10ops-eqiad: Decommission es1001-es1010 - https://phabricator.wikimedia.org/T113080#1671477 (10Cmjohnson) 5Resolved>3Open [17:16:36] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [17:16:37] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [17:16:57] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [17:17:26] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [17:17:27] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [17:17:27] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [17:17:37] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [17:17:46] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [17:17:46] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [17:17:47] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [17:18:17] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [17:18:17] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [17:20:35] (03PS3) 10Filippo Giunchedi: WIP: configure RESTBase for codfw datacenter [puppet] - 10https://gerrit.wikimedia.org/r/240578 (https://phabricator.wikimedia.org/T108613) (owner: 10Eevans) [17:22:23] (03PS4) 10Filippo Giunchedi: configure RESTBase for codfw datacenter [puppet] - 10https://gerrit.wikimedia.org/r/240578 (https://phabricator.wikimedia.org/T108613) (owner: 10Eevans) [17:23:05] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] configure RESTBase for codfw datacenter [puppet] - 10https://gerrit.wikimedia.org/r/240578 (https://phabricator.wikimedia.org/T108613) (owner: 10Eevans) [17:24:21] bblack: you workign on the ipsec stuff? [17:24:28] or those valid issues? [17:25:55] some of those rae new / not in service I think [17:26:00] !log bounce restbase on restbase1002, apply new datacenter config [17:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:26:15] I wonder if it's all related to that (new boxes that are not jiving strongswan wise) [17:26:27] chasemp: plus one just fell over and then that happened so not sure if related [17:26:41] cp1046 went down but maybe unrelated [17:27:20] (03PS3) 10Ori.livneh: Update mod_status configuration [puppet] - 10https://gerrit.wikimedia.org/r/239998 [17:27:37] robh: that box is def not taking traffic I believe [17:28:17] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [17:28:24] it's one of the new jessie hosts not cut over too yet [17:28:25] it looks like all the other boxes are complaining of not being able to connect to 1046 [17:28:35] (03CR) 10jenkins-bot: [V: 04-1] Update mod_status configuration [puppet] - 10https://gerrit.wikimedia.org/r/239998 (owner: 10Ori.livneh) [17:28:39] in eqiad I believe it's still just lvs1001-1006 [17:28:41] the "real" issue is just https://www.mixcloud.com/notifications_unsubscribe/?notice=weekly_update&user=greggrossmeier&_sig=C-qKD7THRJwpT3h0IE2KgNB2O90 [17:28:42] in service [17:28:44] gah [17:28:47] stupid copy paste [17:28:54] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [17:29:22] (now you can also unsub me from mixcloud emails, and you know my nickname, oh well) [17:29:41] if you say so, mr. "greggrossmeier" [17:29:51] if that is your real name [17:29:51] if that is even your real name! [17:29:54] haha :) [17:29:57] nice [17:30:02] I'm not too creative with my usernames anymore, I stopped after highschool/undergrad [17:30:06] damn it i was convinced ori and i would have the exact same syntax [17:30:08] PROBLEM - Restbase root url on restbase1002 is CRITICAL: Connection refused [17:30:11] i should have gone more dramaic. [17:30:19] you mean after you killed the real greg and assumed his identity [17:30:27] one layer of obfuscation is probably sufficient here [17:30:31] someone pull on his beard, it may be fake! [17:30:55] if it isnt fake, be prepared for a punching, you yanked on a man's beard [17:31:18] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [17:31:26] only Rowan gets away with that [17:31:43] ok, now restbase? [17:31:51] maybe that's a mobile cache node actually hieradata/common/cache/mobile.yaml: - 'cp1046.eqiad.wmnet' [17:31:52] hm [17:32:42] gwicke: mobrovac ^ re restbase alert [17:32:59] alerts [17:33:13] (03PS4) 10Ori.livneh: Update mod_status configuration [puppet] - 10https://gerrit.wikimedia.org/r/239998 [17:33:17] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [17:34:07] RECOVERY - Restbase root url on restbase1002 is OK: HTTP OK: HTTP/1.1 200 - 15118 bytes in 0.010 second response time [17:34:31] gwicke: morebots nvm? it recovered [17:34:45] mobrovac: nvm? it recovered [17:34:48] yeah restbase1002 is me ^ [17:34:51] greg-g: all good [17:34:58] sorry, I'll go back to my real job [17:35:12] greg-g: np, thanks for looking :)) [17:35:45] !log powercycling cp1046 at mgmt as I can't ssh in and it seems like it should be up [17:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:17] (03PS5) 10Ori.livneh: Update mod_status configuration [puppet] - 10https://gerrit.wikimedia.org/r/239998 [17:38:37] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 8 ESP OK [17:38:38] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 8 ESP OK [17:38:38] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 8 ESP OK [17:38:47] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [17:38:56] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 8 ESP OK [17:38:57] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 8 ESP OK [17:38:57] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 8 ESP OK [17:38:58] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 8 ESP OK [17:39:36] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 8 ESP OK [17:39:37] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 8 ESP OK [17:39:38] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 8 ESP OK [17:39:46] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 8 ESP OK [17:40:06] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 8 ESP OK [17:41:57] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:43:04] robh: fwiw I think cp1046 is a prod and part of mobile cache and lvs1001 sees it as up again is sending it traffic [17:43:13] but it was a real problem for sure that it become unresponsive [17:43:30] yea just unrelated to the others but still not good [17:43:45] its not unheard of for a random lockup [17:43:52] but since you logged it, if it happens again we'll have record [17:44:08] depending on the crash the logs may be useless [17:45:03] Coren: ok if I move that ironic check-in to an hour later? [17:45:13] andrewbogott: Sure. [17:46:07] robh: 'console com2' just never returned so it's very odd [17:46:25] yea lockup is like that [17:46:45] serial needs a keypress to refresh and if its hard locked it wont do it on redirection [17:46:54] so unlike a crash cart you just dont get to see what was on the screen. [17:47:01] ahh [17:50:40] (03PS6) 10Ori.livneh: Update mod_status configuration [puppet] - 10https://gerrit.wikimedia.org/r/239998 [17:50:48] (03CR) 10Ori.livneh: [C: 032 V: 032] Update mod_status configuration [puppet] - 10https://gerrit.wikimedia.org/r/239998 (owner: 10Ori.livneh) [17:52:29] Krenair: You cherry-picked the realm fix to beta, right? [17:52:54] Oh yeah, I remember seeing it there yesterday [17:53:06] (03CR) 10Chad: [C: 032] Multiversion MWRealm getRealmSpecificFilename: Fix support for filenames without an extension but with full stops in the full path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240612 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [17:53:35] !log rolling restart restbase in eqiad [17:53:36] (03Merged) 10jenkins-bot: Multiversion MWRealm getRealmSpecificFilename: Fix support for filenames without an extension but with full stops in the full path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240612 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [17:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:58:08] PROBLEM - Restbase root url on restbase1001 is CRITICAL: Connection refused [17:58:18] Figures my ssh craps out right now... [17:59:19] !log Merged Apache config change Ia095457fb. It will refresh the Apache service as it rolls out, causing elevated 503s for the next 20 minutes. [17:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:37] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [17:59:45] that's me again ^ [18:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150924T1800). [18:00:52] !log demon@tin Synchronized multiversion/MWRealm.php: (no message) (duration: 00m 17s) [18:00:56] PROBLEM - SSH on cp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:01:07] ACKNOWLEDGEMENT - Restbase root url on restbase1001 is CRITICAL: Connection refused ori.livneh Filippo performing rolling restart of restbase in eqiad [18:01:08] PROBLEM - puppet last run on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:01:47] PROBLEM - IPsec on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:01:51] ACKNOWLEDGEMENT - Restbase endpoints health on restbase2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) ori.livneh Filippo performing rolling restart of restbase in eqiad [18:02:17] PROBLEM - RAID on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:04:05] (03PS3) 10Dzahn: annual: redirect 2007-2013 URLs to foundation wiki [puppet] - 10https://gerrit.wikimedia.org/r/240735 (https://phabricator.wikimedia.org/T113113) [18:04:06] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [18:04:17] RECOVERY - Disk space on labstore1002 is OK: DISK OK [18:05:06] (03PS4) 10Dzahn: annual: redirect 2007-2013 URLs to foundation wiki [puppet] - 10https://gerrit.wikimedia.org/r/240735 (https://phabricator.wikimedia.org/T113113) [18:05:43] what's up with cp1046, that doesnt belong to the others, does it [18:06:15] !log depooling cp1046, stability issues [18:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:07:06] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [18:07:06] ori: thanks for the ack! [18:07:27] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [18:07:44] (03PS1) 10Ori.livneh: Partial revert of I94c343d [puppet] - 10https://gerrit.wikimedia.org/r/240771 [18:07:47] mutante: ^ [18:07:56] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [18:07:58] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [18:08:06] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [18:08:07] alright, there were no more mgmt sessions available [18:08:16] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Puppet has 1 failures [18:08:17] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [18:08:17] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [18:08:17] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [18:08:27] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [18:08:56] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 1 failures [18:08:57] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [18:08:58] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [18:09:07] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [18:09:13] ori: ah :) thank you [18:09:28] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [18:09:37] PROBLEM - puppet last run on mw1088 is CRITICAL: CRITICAL: puppet fail [18:09:38] PROBLEM - puppet last run on mw2014 is CRITICAL: CRITICAL: Puppet has 1 failures [18:10:07] RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15118 bytes in 0.020 second response time [18:10:26] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Puppet has 1 failures [18:10:27] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Puppet has 1 failures [18:10:27] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Puppet has 1 failures [18:10:27] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Puppet has 1 failures [18:10:27] PROBLEM - puppet last run on mw2121 is CRITICAL: CRITICAL: Puppet has 1 failures [18:10:37] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures [18:10:37] RECOVERY - SSH on cp1046 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [18:10:38] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 3.13 ms [18:10:47] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 31 minutes ago with 0 failures [18:10:48] PROBLEM - puppet last run on mw2213 is CRITICAL: CRITICAL: Puppet has 2 failures [18:10:56] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 8 ESP OK [18:10:57] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 8 ESP OK [18:10:57] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 8 ESP OK [18:11:01] 6operations, 10Traffic: cp1046 is crashing and becoming unresponsive - https://phabricator.wikimedia.org/T113639#1671942 (10chasemp) 3NEW [18:11:06] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 8 ESP OK [18:11:18] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 8 ESP OK [18:11:27] RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 24 ESP OK [18:11:28] 6operations, 10netops: Upgrade JunOS on cr1/cr2-codfw - https://phabricator.wikimedia.org/T113640#1671953 (10faidon) 3NEW [18:11:46] PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Puppet has 1 failures [18:11:47] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 8 ESP OK [18:11:57] RECOVERY - RAID on cp1046 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [18:11:58] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 8 ESP OK [18:11:58] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 8 ESP OK [18:12:06] PROBLEM - puppet last run on mw2001 is CRITICAL: CRITICAL: Puppet has 3 failures [18:12:07] PROBLEM - puppet last run on mw2007 is CRITICAL: CRITICAL: Puppet has 1 failures [18:12:16] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 8 ESP OK [18:12:16] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 8 ESP OK [18:12:17] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 8 ESP OK [18:12:17] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 8 ESP OK [18:14:03] (03PS1) 1020after4: wikipedia wikis to 1.26wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240774 [18:14:39] ori: but it is still 'Require local'? how did you limit it to the loopback interface [18:14:57] hey yuvipanda, puppet and gmond are running on hafnium now, yes? [18:15:06] 6operations, 6Security, 10Wikimedia-Apache-configuration, 7Privacy: Apache 2.4 exposes server status page by default? - https://phabricator.wikimedia.org/T113090#1671987 (10ori) [18:15:09] ottomata: no [18:15:10] (03CR) 1020after4: [C: 032] wikipedia wikis to 1.26wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240774 (owner: 1020after4) [18:15:13] ottomata: well I haven't touched i [18:15:14] t [18:15:18] (03Merged) 10jenkins-bot: wikipedia wikis to 1.26wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240774 (owner: 1020after4) [18:16:17] !log restart restbase on restbase1003 [18:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:05] ja someone reenabled them :) [18:18:08] (03CR) 10Dzahn: [C: 032] annual: redirect 2007-2013 URLs to foundation wiki [puppet] - 10https://gerrit.wikimedia.org/r/240735 (https://phabricator.wikimedia.org/T113113) (owner: 10Dzahn) [18:18:34] yuvipanda: I did just find some eventlogging graphite conusmer process there that upstart was trying to start over and over again, but no config file existed, so it just logged errors and died [18:18:52] !log restart restbase on restbase1004 [18:18:55] i stopped that. not sure if that was the cause of the gmond issue, not sure how it would be [18:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:19:08] ottomata: ok. I've no idea about any of this so it's all going in one end and the other end - you should talk to Krinkle or ori about hafnium and not me :) [18:19:26] I was only the messenger etc, looking because Krinkle didn't have hafnium access [18:19:30] !log restart restbase on restbase1005 [18:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:19:43] heheh ok [18:19:53] i will respond to email thread [18:20:09] ottomata: Which one didn't have upstart? [18:20:17] PROBLEM - Restbase root url on restbase1003 is CRITICAL: Connection refused [18:20:44] !log moved oauthadmin group from User:Yuvipanda@metawiki to User:YuviPanda@metawiki [18:20:46] 6operations, 10Annual-Report, 5Patch-For-Review: redirect to older annual reports from annual.wikimedia.org - https://phabricator.wikimedia.org/T113113#1672023 (10Dzahn) terbium:~] $ apache-fast-test annual.url bromine.eqiad.wmnet testing 22 urls on 1 servers, totalling 22 requests spawning threads.. https... [18:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:20:50] 6operations, 10Annual-Report, 5Patch-For-Review: redirect to older annual reports from annual.wikimedia.org - https://phabricator.wikimedia.org/T113113#1672024 (10Dzahn) a:3Dzahn [18:20:57] 6operations, 10Annual-Report, 5Patch-For-Review: redirect to older annual reports from annual.wikimedia.org - https://phabricator.wikimedia.org/T113113#1672025 (10Dzahn) 5Open>3Resolved [18:21:47] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedia wikis to 1.26wmf24 [18:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:22:07] RECOVERY - Restbase root url on restbase1003 is OK: HTTP OK: HTTP/1.1 200 - 15118 bytes in 0.011 second response time [18:23:16] 6operations, 10Annual-Report: redirect to older annual reports from annual.wikimedia.org - https://phabricator.wikimedia.org/T113113#1672029 (10Dzahn) [18:24:04] Krinkle: it didn't ahve an /etc/eventlogging.d config [18:24:05] on purpose [18:24:10] it was removed [18:24:18] but, upstart i guess still had it registered somehow [18:24:33] 997 17109 0.0 0.0 52412 12232 ? Rs 18:15 0:00 /usr/bin/python -OO /usr/local/bin/eventlogging-consumer @/etc/eventlogging.d/consumers/graphite [18:24:37] ottomata: I mean, which consumer? Did it say what it was doing or some clue as to which proc... [18:24:39] eventlogging-consumer: error: [Errno 2] No such file or directory: '/etc/eventlogging.d/consumers/graphite' [18:24:41] Hmm [18:24:58] # eventloggingctl status [18:24:58] forwarder stop/waiting [18:24:58] multiplexer stop/waiting [18:24:58] processor stop/waiting [18:24:58] reporter stop/waiting [18:24:59] consumer graphite start/post-stop 17753 [18:25:01] so I did [18:25:05] eventloggingctl stop [18:25:07] Oh, an actual consumer. [18:25:10] yea [18:25:15] not a kafka subscriber [18:25:17] right [18:25:18] nor zmq [18:25:23] interesting [18:25:26] No idea what that'd be for. [18:25:30] it used to use zmq when zmq was a think [18:25:38] yeah, um [18:25:45] Right, so it was logging the counts to statsd from there? [18:25:52] i think so, not sure if that was teh statsd one... [18:26:00] I thought eventlogging server main had aconfig option to report to statsd [18:26:09] 6operations, 10Annual-Report, 10Traffic, 5Patch-For-Review: move annual report from zirconium to bromine - https://phabricator.wikimedia.org/T104936#1672043 (10Dzahn) [18:26:11] looking up in puppet [18:26:17] Krinkle: naw [18:26:31] all the statsd reporting was done by consuming the whole stream [18:26:35] and then it was parsed [18:26:41] there is a statsd handler in eventlogging code [18:26:43] but it is just a consumer [18:27:17] looking in puppet history to verify if this was that. [18:27:56] yes. that was it [18:28:03] - eventlogging::service::consumer { 'graphite': [18:28:03] - input => "tcp://${multiplexer_host}:8600", [18:28:03] - output => "statsd://${statsd_host}:8125", [18:28:33] Hm.. so the logic for sending to statsd is a standard config option for an eventlogging consumer [18:28:38] no [18:28:46] it is the output of an individual consumer [18:28:49] so that was a process [18:28:54] that subscribed to the full zmq stream [18:28:54] but it wasn't used on the main eventlogging server, isntead we ran a no-op consumer on hafnium that just emits to statsd [18:29:01] processed every event, and counted schemas, and sent to stastd [18:29:06] Hm.. ok [18:29:07] yes, thats right [18:29:25] ottomata: What kind of consumers do we have now? [18:29:41] We have ones that write to sql and to disk, right? [18:29:46] Krinkle: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/handlers.py#L352 [18:29:52] And the rest are all subscribers to kafka or zmq? [18:30:11] ottomata: I mean, which ones we run in prod at the moment. [18:30:50] ja, Krinkle the only prod eventlogging consumers right now are mysql and log files [18:31:00] but, we also run camus which imports from kafka into hdfs [18:31:10] Ah, yeah, so the statsd handling is part of eventlogging, it wasn't done as part of a subscriber (e.g. the way we have a subscriber on hafnium for navtiming, mwjsdeprecate etc.), but an option directly for consumers. [18:31:43] uHhhHhHh, the statsd handler is built into eventlogging code as a writer [18:31:51] Yep [18:31:52] you can run a consumer that writes to statsd using that handler [18:31:57] Yeah [18:32:12] I'm not recommending that we do, but we could've done it as part of an existing consumer, right? [18:32:24] Or does a consumer only do one writer? [18:33:01] a consumer only does one writer, buuuut it could do more if it were coded to do so [18:33:21] Krinkle: the concepts of forwarder and consumer are a little overlapping in eventlogging [18:33:29] 6operations, 10Traffic: cp1046 is crashing and becoming unresponsive - https://phabricator.wikimedia.org/T113639#1672079 (10chasemp) I don't see anything damning in dmesg and I can't find anything that is narrowing down a hardware issue. The best though i have is MEM used seems to have spiked before the last... [18:33:32] the only real difference is that fowarder appends the raw=True to its readers and writers [18:33:50] so it doesn't parse the json into an object [18:34:07] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:34:18] but, um, to answer your question, we probably could emit schema event counts to statsd from the processor [18:34:20] as an option [18:34:26] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:34:30] rather than a specific output of a consumer. [18:34:43] because each processor knows which schema it has validated [18:34:44] ottomata: Yeah, I'm also not entirely clear on the current flow. So we have requests -> varnishkafka -> eventlogging party of parsing and validation -> consumers (mysql, fs, and also topics back to kafka) -> subscribers to kafka [18:34:49] Is that accurate? [18:35:17] RECOVERY - puppet last run on mw2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:35:26] yeah, the main eventlogging already uses statsd to report total validation and raw counts, right? Or does that go via kafka. [18:35:26] RECOVERY - puppet last run on mw2185 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:35:33] ja Krinkle https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Architecture [18:35:47] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:35:57] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:35:58] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:35:58] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:36:06] RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:36:06] RECOVERY - puppet last run on mw2121 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:36:34] i should add in there the legacy zmq forwarder too, although i guess we want to remove that :) [18:36:36] RECOVERY - puppet last run on mw2213 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:37:07] RECOVERY - puppet last run on mw1088 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:37:38] RECOVERY - puppet last run on mw2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:37:38] RECOVERY - puppet last run on mw2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:37:48] but basically, Krinkle, raw unvalidated json strings come into kafka from varnishkafka (client side), then N eventlogging-processors consume that and validate. Then the processor outputs to schema based topics in kafka AND also to a mixed topic that contains events from all (most) schemas. [18:38:25] this eventlogging-valid-mixed topic is useable just as the ZMQ port 8600 was [18:38:43] so we didn't have to change anything with the MySQL or files consumer, other than telling them to consume from Kafka instead of zmq [18:38:58] but, now that there are schema based topics in Kafka, like eventlogging_NavigationTiming [18:39:22] one can consume directly from Kafka using eventlogging or some other Kafka client and get those type of events only [18:39:38] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1672098 (10Eevans) The list of datacenters in RESTBase config has been expanded to include codfw, replication has been updated, and data is now being repli... [18:40:16] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 875 [18:40:16] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 878 [18:42:57] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [18:45:06] Krinkle: btw, we can turn the statsd consumer back on if you need it, i just worry about it if/when we increase the throughput in eventlogging [18:45:16] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1174 [18:45:16] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1178 [18:45:25] since it has to consume and parse every event [18:48:00] ottomata: No, I don't think we'll need it. And if we do, we're probably better off not re-parsing it from a consumer but adding it as a feature coupled into eventlogging. I'm not usually a fan of tight coupling, but statsd support seems reasonable for a server library to provide. [18:50:16] RECOVERY - check_mysql on db1008 is OK: Uptime: 5364140 Threads: 1 Questions: 36912342 Slow queries: 36308 Opens: 88576 Flush tables: 2 Open tables: 64 Queries per second avg: 6.881 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [18:50:16] RECOVERY - check_mysql on lutetium is OK: Uptime: 3129266 Threads: 2 Questions: 22677865 Slow queries: 21386 Opens: 41126 Flush tables: 2 Open tables: 64 Queries per second avg: 7.247 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [18:50:23] Krinkle: you might be able to get around the OneMinuteAverage thing if you use count instead [18:50:43] ottomata: "count" ? [18:51:02] yeah it is a raw counts of messages in that topic [18:51:05] so always increasing [18:51:40] maybe that + nonNegativeDerivative and scale(60)? [18:52:28] naww maybe not. [18:52:29] dunno [18:52:32] just an idea [18:54:52] !log catrope@tin Synchronized php-1.26wmf24/extensions/Flow/: Debugging for FlowFixLinks.php (duration: 00m 20s) [18:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:09] (03CR) 10Dzahn: [C: 032] Partial revert of I94c343d [puppet] - 10https://gerrit.wikimedia.org/r/240771 (owner: 10Ori.livneh) [18:55:24] (03PS2) 10Dzahn: Partial revert of I94c343d [puppet] - 10https://gerrit.wikimedia.org/r/240771 (owner: 10Ori.livneh) [18:55:25] I compared OneMinuteRate/scale(60) to the old statsd metric and it matched exactly within a margin of 2 counts [18:55:34] on a scale of 350 events [18:55:35] ah ok, cool [18:55:38] sounds like you got it then, nm [18:55:39] so I'm okay with this for now [18:55:41] k [18:56:16] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:56:57] (03PS3) 10Dzahn: mailman-apache: Partial revert of I94c343d [puppet] - 10https://gerrit.wikimedia.org/r/240771 (owner: 10Ori.livneh) [18:59:19] (03PS4) 10Dzahn: mailman-apache: Partial revert of I94c343d [puppet] - 10https://gerrit.wikimedia.org/r/240771 (https://phabricator.wikimedia.org/T113090) (owner: 10Ori.livneh) [18:59:28] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/240771 (https://phabricator.wikimedia.org/T113090) (owner: 10Ori.livneh) [19:17:41] (03CR) 10Krinkle: [C: 031] Removed ignore_user_abort( true ) line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240634 (owner: 10Aaron Schulz) [19:22:33] (03PS1) 10Ori.livneh: xenon: add xenon-grep script [puppet] - 10https://gerrit.wikimedia.org/r/240793 [19:22:49] (03CR) 10Ori.livneh: [C: 032 V: 032] xenon: add xenon-grep script [puppet] - 10https://gerrit.wikimedia.org/r/240793 (owner: 10Ori.livneh) [19:31:21] 6operations, 10MediaWiki-extensions-ZeroPortal, 10Traffic, 6Zero: zerofetcher in production is getting throttled for API logins - https://phabricator.wikimedia.org/T111045#1672248 (10BBlack) In this case, we're logging into the same account from many IP addresses (let's say ~100), and each of those IP addr... [19:32:19] 6operations, 10MediaWiki-extensions-ZeroPortal, 10Traffic, 6Zero: zerofetcher in production is getting throttled for API logins - https://phabricator.wikimedia.org/T111045#1672250 (10BBlack) The code doing the fetching, in case that provides any details as to what kind of login it's using, is: https://gith... [19:34:39] !log changing labs route on cr1 and cr2 from 10.68.16.0/22 to 10.68.16.0/21 which matches references, fw setting and manifests/network.pp [19:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:35:09] !log depooled cp1046 from confd, committed pybal depool for LVS as well [19:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:38] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: puppet fail [19:41:02] (03PS4) 10Dzahn: annualreport: puppetize git cloning [puppet] - 10https://gerrit.wikimedia.org/r/240606 [19:41:40] (03PS5) 10Dzahn: annualreport: puppetize git cloning [puppet] - 10https://gerrit.wikimedia.org/r/240606 [19:41:57] (03CR) 10Dzahn: [C: 032] annualreport: puppetize git cloning [puppet] - 10https://gerrit.wikimedia.org/r/240606 (owner: 10Dzahn) [19:42:22] I need to request a temporary lift of IP Cap for THIS SATURDAY for the Women of Wikipedia Edit-A-Thon in UC Berkeley. For more information, our grant page can be found here: https://meta.wikimedia.org/wiki/Grants:IEG/WOW!_Editing_Group [19:42:37] How can I make this happen ASAP? [19:43:52] Saturday? [19:43:55] Seriously? [19:44:18] it will need the IP addresses and a mediawiki config change [19:44:21] jouncebot: next [19:44:21] In 3 hour(s) and 15 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150924T2300) [19:44:56] Sigh. [19:45:50] tkatoni: can you make a ticket in phabricator? [19:49:12] basically just copy/paste what you said here. we can edit the details like projects etc [19:50:50] What is phabricator? [19:51:36] https://meta.wikimedia.org/wiki/Main_Page -> search box -> "Phabricator" [19:51:40] The IP address I am trying to lift is for Cal Visitor, IPv4 address: 10.105.187.114 [19:51:49] That's an internal IP. [19:53:04] can i not lift an internal IP? I apologize, I am not tech savy, I am simply a humanities cal student running this edit-a-thon [19:53:08] (03CR) 10Dzahn: "maybe separate adding the module and the "apply on fermium" part. could it be useful for other things not on fermium? for labs users?" [puppet] - 10https://gerrit.wikimedia.org/r/231973 (https://phabricator.wikimedia.org/T82576) (owner: 10Ori.livneh) [19:53:32] tkatoni: what ip shows on http://ipchicken.com/ for you [19:53:51] 192.31.105.187 [19:54:26] are you sure everyone at the editathon will be on that ip? [19:54:29] assuming your lab or network is nat'd out to one IP that would be it, you probably want to ask your IT folks if that is the case [19:54:29] tkatoni: are you currently connected via the cal visitor network? [19:54:58] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [19:55:10] yes, the cal visitor network is what all of the editors will connect to [19:56:04] it's UC Berkeley, maybe we can just do this network, i suppose [19:56:06] NetRange: 192.31.105.0 - 192.31.105.255 [19:56:13] NetName: UCC-PRO-UCB [19:56:42] best to throw that in teh ticket I imagine [19:57:02] yes please! how do I do that? [19:57:34] the edit-a-thon is this saturday (9/26) from 11-am - 2pm pacific time [19:57:49] !log ori@tin Synchronized php-1.26wmf24/extensions/ContentTranslation: d079d5dd71: Updated mediawiki/core Project: mediawiki/extensions/ContentTranslation 8559ee614975f25b71a732ca0fb1bb6d489c9d33 (duration: 00m 18s) [19:57:50] tkatoni: go to phabricator.wikimedia.org and try to login with your normal wiki user [19:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:58:10] tkatoni: if that works, click the plus symbol in the upper right corner and select "task" [19:59:37] in the "title" field put a summary like an email subject, and paste all the rest into the large "description" field, we'll handle it from there [20:02:32] There is no plus symbol. I tried to log in with my wiki account info and it is not letting me. [20:02:54] all i see is a server admin log [20:03:27] server admin log has nothing to do with this [20:03:37] that's just normal bot traffic on this irc channel [20:03:40] tkatoni: phabricator.wikimedia.org [20:04:06] !fileabug [20:04:08] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [20:04:16] !fileabug is https://phabricator.wikimedia.org/maniphest/task/create/ [20:04:16] You are not authorized to perform this, sorry [20:04:22] whattttt. D: [20:04:28] I trust: petan|w.*wikimedia/Petrb (2admin), .*@wikimedia/.* (2trusted), .*@mediawiki/.* (2trusted), .*@mediawiki/Catrope (2admin), .*@wikimedia/RobH (2admin), .*@wikimedia/Ryan-lane (2admin), petan!.*@wikimedia/Petrb (2admin), .*@wikimedia/Krinkle (2admin), [20:04:28] @trusted [20:04:32] tkatoni: https://www.mediawiki.org/wiki/Phabricator/Help#Creating_your_account_and_notifications [20:04:34] You are unknown to me :) [20:04:34] @whoami [20:04:36] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:04:38] tkatoni: on the phabricator login page, you need to press the "Login with Mediawiki" button down below the login fields [20:04:54] ok, on it [20:05:12] yeah the main form is for LDAP [20:05:18] hmm [20:05:28] anyone wants to `@trustadd .*@wikipedia/.*` ? [20:07:46] Wrong number of parameters, go fix it - example @trustadd regex (admin|trusted) [20:07:46] @trustadd .*@wikipedia/.* [20:07:53] Successfuly added .*@wikipedia/.* [20:07:53] @trustadd .*@wikipedia/.* trusted [20:09:20] I am logged into Phabricator, about to write in a new task. What should I put to get the ip lifted? [20:09:59] RoanKattouw: @wiktionary rages :) [20:10:21] tkatoni: the date, the location, the link you gave us about it, and preferably the IP range [20:10:25] start time, end time, IP, wikis creation will happen on [20:10:30] and the link [20:11:03] ok, thank you! [20:11:41] oh, and how many people are expected roughly, i guess [20:11:56] 30 [20:12:18] and which usersame should I assign it to? [20:12:38] on duty person is probably best dunno who that is, mutante? [20:12:47] Don't. [20:12:58] It's Wikimedia-site-requests, not operations. [20:13:06] i was last week, today it's coren, but what Krenair said [20:13:15] so i should put down coren? [20:13:19] No. [20:13:28] Yep. Kernair is right, as always. :-) [20:13:34] thank you! [20:14:21] does someone traige Wikimedia-site-requests requests? [20:14:32] any particular project name i should put? [20:15:01] Wikimedia-Site-Requests [20:15:06] tkatoni, Wikimedia-site-requests [20:15:18] ok. [20:15:51] how do I know once the ip ban request has been granted? will you contact me? [20:16:15] you will get notifications from the ticket itself [20:16:19] You will receive email updates about changes to the ticket via your wiki email address I think. [20:16:21] when others leave comments or change it [20:23:09] you set an address on register that may or may not be the wiki one, but either way, updates :) [20:26:06] ah, email is separate, okay [20:40:16] (03PS1) 10Alex Monk: Raise account creation throttle for editathon at UC Berkeley [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240880 (https://phabricator.wikimedia.org/T113654) [20:47:16] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [20:52:47] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:57:09] (03PS1) 10Smalyshev: switch to git-based portal [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) [20:58:14] (03CR) 10Dzahn: [C: 031] "yes please, that's the UC Berkeley network they will be using" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240880 (https://phabricator.wikimedia.org/T113654) (owner: 10Alex Monk) [21:00:31] (03PS2) 10Dzahn: Raise account creation throttle for editathon at UC Berkeley [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240880 (https://phabricator.wikimedia.org/T113654) (owner: 10Alex Monk) [21:06:08] bblack: yt? [21:07:33] Krenair: should we add the change to swat really quick before it starts? [21:08:40] yes [21:08:45] nuria: somewhat [21:11:44] (03Abandoned) 10Dzahn: dbtree: ensure that git clone is latest [puppet] - 10https://gerrit.wikimedia.org/r/240608 (owner: 10Dzahn) [21:14:04] (03CR) 10Dzahn: ":) i can rename it. yea, hmm.. now i'm not sure if we still need it or not" [puppet] - 10https://gerrit.wikimedia.org/r/228137 (owner: 10Dzahn) [21:14:24] mutante, listed [21:14:43] Krenair: :) [21:14:56] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1672662 (10ellery) @Jgreen Does that mean the data in pgheres.bannerimpressions is incorrectly scaled? If not, w... [21:18:35] (03CR) 10Dzahn: "yea, agreed. i don't have the answer to all the comments yet. note that i simply followed the example from https://github.com/librenms/lib" [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) (owner: 10Dzahn) [21:24:21] (03CR) 10Dzahn: librenms - enable LDAP auth (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) (owner: 10Dzahn) [21:24:23] (03PS5) 10Dzahn: librenms - enable LDAP auth (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) [21:24:45] !bash ostriches: Driving. Shouldn't be on irc lol [21:26:09] thank you [21:26:20] chasemp: did you mean https://tools.wmflabs.org/bash/help :) [21:26:28] https://tools.wmflabs.org/bash/quip/AVABPoSb1oXzWjit5rXg [21:26:34] mutante: no [21:26:50] :) [21:26:59] the lack of confirmation is slightly annoying, but I love bd's bashbot [21:28:48] https://tools.wmflabs.org/bash/quip/AU_JvX6F1oXzWjit5hxZ [21:36:42] (03CR) 10MarcoAurelio: Add $wgMassMessageWikiAliases configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237687 (owner: 10Legoktm) [21:37:54] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1672781 (10atgo) hey @ellery - right now we've got it in Q2, which means we'll hopefully look at it before the ne... [21:41:18] greg-g: what is the practical limit on the "No new features" rule for requesting a SWAT deploy? We have medium-sized tweaks to a feature and its API, no user-facing changes... Also, what is the recommended advanced time to book a SWAT slot? [21:41:24] Thx in advance [21:41:26] :) [21:41:41] (03CR) 10Alex Monk: Add $wgMassMessageWikiAliases configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237687 (owner: 10Legoktm) [21:46:59] (03CR) 10Tim Landscheidt: "recheck" [software] - 10https://gerrit.wikimedia.org/r/205086 (owner: 10Tim Landscheidt) [21:54:59] AndyRussG: people add them right up until they start [21:55:12] greg-g: cool beans! [21:55:20] AndyRussG: and, really it's a "use your best judgement that you'd be able to convince your mother you did tthe right thing" ;) [21:55:31] where "your mother" == me :P [21:57:09] greg-g: heheheh I'll need more data on ur convincibility then ;) [21:57:44] K I'm gonna target Monday morning, looks pretty free so far... [21:59:22] AndyRussG, er, well [21:59:31] Greg is right that people do add things right up until they start [21:59:43] and occasionally a few minutes towards the end of the window [21:59:57] But I think it's preferred to add them as soon as they're ready :) [22:01:47] AndyRussG: plus +111 to what Krenair said :) [22:02:39] That way you don't forget, and we may be able to do some review in advance in some cases [22:02:57] Krenair: greg-g: ah K thanks, makes a lot of sense... I'm just waiting from a performance-team sign-off on a change (that'd be included w/ a few others), then we can hopefully merge stuff and test on the beta cluster tomorrow and over the weekend. I think I shouldn't add to the SWAT anything that isn't merged yet [22:03:23] yeah, SWAT is not for code review, please get it merged to master beforehand [22:03:27] So in theory that'd be booking a space on the SWAT tomorrow once it's merged... I hope [22:03:37] Tomorrow is Friday [22:03:56] Krenair: no I mean, on Friday signing up for MOnday's SWAT [22:04:03] oh I see [22:04:27] Krenair: and yes, I mean, it wouldn't be for CR, just wanting to book the slot ahead of time, but it sounds like booking on Friday for a Monday SWAT is OK [22:04:57] apologies for bugging you both w/ this [22:05:33] np [22:07:24] AndyRussG: better to talk than not! :) [22:09:46] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1672900 (10Jgreen) >>! In T97676#1672662, @ellery wrote: > @Jgreen Does that mean the data in pgheres.bannerimpr... [22:10:50] hi, is https://phabricator.wikimedia.org/diffusion/ODBE/repository/master/ being maintained? [22:10:56] It build depends on things that are not in newer versions of ubuntu. We'll probably fix it for us, and can contribute the changes back if you want. [22:12:20] repo doesn't look very active [22:12:36] akosiaris contributed at one point, I would ask ottomata but it doesn't seem like he's around [22:12:37] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: puppet fail [22:13:01] I noticed it hadn't been updated for a while, but I don't know if that's because it's no longer being user, or because it hasn't needed to be. [22:13:42] https://gerrit.wikimedia.org/r/#/q/project:operations/debs/kafka,n,z [22:13:50] there's a bunch of things there which aren't in phab's copy for some reason [22:14:23] ooh, that's got 8.2 [22:14:29] *0.8.2 [22:14:37] haha [22:14:48] Is that like MediaWiki's 1.x versioning? :) [22:15:45] Krenair: thanks! [22:16:01] umm. I didn't do much, but you're welcome [22:16:21] greg-g: yeah! but I do feel we're a lot more last-minute-y than is ideal and I think that's not too fair for ops folks :( [22:16:22] Krenair: pointing me to the newer version was super-helpful. [22:17:35] AndyRussG: live and learn [22:19:55] heh [22:23:56] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: Puppet has 2 failures [22:24:32] (03PS1) 10Thcipriani: De-decorate inside_git_dir [tools/scap] - 10https://gerrit.wikimedia.org/r/240912 [22:28:28] 6operations, 10ops-eqiad, 10netops: cr2-eqiad PEM 2 failure - https://phabricator.wikimedia.org/T112000#1672972 (10Cmjohnson) 5Open>3Resolved RMA was received...resolving Dear Sir/Madam, For RMA R395687-1 for Case Number 2015-0910-0589, Juniper has received part with serial# QCS1042C022 at our return... [22:32:51] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for sbisson - https://phabricator.wikimedia.org/T113676#1672993 (10Catrope) 3NEW [22:36:42] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for sbisson - https://phabricator.wikimedia.org/T113676#1673020 (10Catrope) Supervisor approval: I'm Stephane's direct supervisor and I filed the task :) Who provides "project lead approval" for stat1003? @kevinator perhaps? [22:40:17] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:47:14] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for etonkovidova - https://phabricator.wikimedia.org/T113680#1673053 (10Catrope) 3NEW [22:47:36] (03PS2) 10Tim Landscheidt: [WIP] swiftrepl: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/205086 [22:49:47] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [22:55:19] (03Abandoned) 10Tim Starling: Give ElasticSearch roots access to the logstash boxes also [puppet] - 10https://gerrit.wikimedia.org/r/240635 (https://phabricator.wikimedia.org/T113569) (owner: 10Tim Starling) [22:55:37] (03PS2) 10Tim Starling: Update personal .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/238363 [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150924T2300). [23:00:04] Krenair irc-nickname: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:31] (03CR) 10Alex Monk: [C: 032] Raise account creation throttle for editathon at UC Berkeley [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240880 (https://phabricator.wikimedia.org/T113654) (owner: 10Alex Monk) [23:01:41] (03Merged) 10jenkins-bot: Raise account creation throttle for editathon at UC Berkeley [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240880 (https://phabricator.wikimedia.org/T113654) (owner: 10Alex Monk) [23:02:28] !log krenair@tin Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/240880/ (duration: 00m 17s) [23:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:37] (03PS1) 10Ottomata: Puppetize etcd use for eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/240916 (https://phabricator.wikimedia.org/T112688) [23:09:46] (03CR) 10Ottomata: "Giuseppe, see TODO comments in role/eventlogging.pp" [puppet] - 10https://gerrit.wikimedia.org/r/240916 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [23:15:12] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1673134 (10egalvezwmf) Thank you! [23:15:29] (03CR) 10Tim Landscheidt: [C: 04-1] ""limit" still to be dealt with. Not sure how best to attack it." [software] - 10https://gerrit.wikimedia.org/r/205086 (owner: 10Tim Landscheidt) [23:21:39] (03PS1) 10Krinkle: beta: Remove commented out rules for www2.knams.wikimedia.org/stats [puppet] - 10https://gerrit.wikimedia.org/r/240919 [23:41:16] 6operations: NetEase/YouDao company seeks guidance for setting up local mirror of wikipedia - https://phabricator.wikimedia.org/T89137#1673233 (10JanWMF) 5Open>3Resolved