[00:07:12] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: (unnamed) not-conn: cp3034_v6 no-xfrm: cp3037_v6 [00:08:13] (03PS1) 10Tim Landscheidt: WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 [00:08:36] (03CR) 10jenkins-bot: [V: 04-1] WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [00:09:03] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 44 ESP OK [00:10:32] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [00:18:23] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:32:32] PROBLEM - Disk space on mw1010 is CRITICAL: DISK CRITICAL - free space: / 8184 MB (3% inode=93%) [00:36:02] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [00:49:32] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:04:13] PROBLEM - RAID on es1006 is CRITICAL 1 failed LD(s) (Degraded) [01:20:23] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:22:14] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 497 bytes in 0.002 second response time [01:58:53] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 15 not-conn: cp4011_v6 [02:01:03] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 16 ESP OK [02:21:50] !log l10nupdate@tin Synchronized php-1.26wmf19/cache/l10n: l10nupdate for 1.26wmf19 (duration: 06m 36s) [02:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:03:02] PROBLEM - LVS HTTP IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [03:04:53] RECOVERY - LVS HTTP IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 455 bytes in 0.004 second response time [04:12:24] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 15.38% of data above the critical threshold [100000000.0] [04:18:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [04:28:22] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [04:43:14] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [05:26:32] PROBLEM - salt-minion processes on analytics1037 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:26:33] PROBLEM - Hadoop DataNode on analytics1037 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:26:33] PROBLEM - dhclient process on analytics1037 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:27:02] PROBLEM - Hadoop NodeManager on analytics1037 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:13] RECOVERY - salt-minion processes on analytics1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:32:23] RECOVERY - Hadoop DataNode on analytics1037 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [05:32:23] RECOVERY - dhclient process on analytics1037 is OK: PROCS OK: 0 processes with command name dhclient [05:32:52] RECOVERY - Hadoop NodeManager on analytics1037 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:29:53] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures [06:29:53] PROBLEM - puppet last run on mc2005 is CRITICAL Puppet has 1 failures [06:30:43] PROBLEM - puppet last run on pybal-test2002 is CRITICAL puppet fail [06:31:53] PROBLEM - puppet last run on mw2207 is CRITICAL Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on mw2158 is CRITICAL Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on holmium is CRITICAL Puppet has 1 failures [06:32:13] PROBLEM - puppet last run on mw1009 is CRITICAL Puppet has 2 failures [06:32:13] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 1 failures [06:33:13] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures [06:33:24] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:33:52] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:43:07] !log reloading dbproxy1003 service [06:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:44:03] RECOVERY - haproxy failover on dbproxy1003 is OK check_failover servers up 2 down 0 [06:55:12] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:55:22] RECOVERY - puppet last run on mw1009 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:24] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:57:04] RECOVERY - puppet last run on mc2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:58:42] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:02] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:53] RECOVERY - puppet last run on pybal-test2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:21] 6operations, 10ops-eqiad: es1005 and es1006 have degraded RAIDs (failed disks each) - https://phabricator.wikimedia.org/T110008#1566327 (10jcrespo) 3NEW [07:04:03] ACKNOWLEDGEMENT - RAID on es1006 is CRITICAL 1 failed LD(s) (Degraded) Jcrespo T110008 [07:08:53] PROBLEM - puppet last run on dbstore2002 is CRITICAL puppet fail [07:10:52] RECOVERY - puppet last run on dbstore2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:02] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1414 bytes in 0.149 second response time [07:44:22] good morning [07:49:12] 6operations, 10MediaWiki-API, 7HHVM, 7Pywikibot-tests, 7Wikimedia-log-errors: internal_api_error_BadMethodCallException: [xxx] Exception Caught: Call to a member function getNames() on a non-object (NULL) - https://phabricator.wikimedia.org/T109929#1566378 (10Joe) [07:49:24] <_joe_> ciao hashar! [07:49:29] <_joe_> welcome back [07:49:59] danke :) [07:50:53] 10Ops-Access-Requests, 6operations, 3Mobile-Content-Service: Replace dbrant with mholloway for MobileApps production access - https://phabricator.wikimedia.org/T109857#1566385 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [07:51:14] 10Ops-Access-Requests, 6operations, 3Mobile-Content-Service: Add bsitzmann and mholloway as deployers for the MobileApps service - https://phabricator.wikimedia.org/T109855#1566390 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [07:52:08] 6operations, 10MediaWiki-API, 7HHVM, 7Pywikibot-tests, 7Wikimedia-log-errors: internal_api_error_BadMethodCallException: [xxx] Exception Caught: Call to a member function getNames() on a non-object (NULL) - https://phabricator.wikimedia.org/T109929#1566400 (10jayvdb) @joe, `tests.api_tests.TestParamInfo.... [07:52:41] hashar_: good morning and welcome back :-) [08:04:30] 6operations, 10MediaWiki-API, 7HHVM, 7Pywikibot-tests, 7Wikimedia-log-errors: internal_api_error_BadMethodCallException: [xxx] Exception Caught: Call to a member function getNames() on a non-object (NULL) - https://phabricator.wikimedia.org/T109929#1566433 (10Joe) @jayvdb thanks, I am trying to extract t... [08:05:11] 6operations, 7Monitoring: add pdu redundancy checking to server/router/switch checks in icinga - https://phabricator.wikimedia.org/T109903#1566435 (10fgiunchedi) +1 that'd be really useful, are the PDUs in ulsfo monitorable too? I'm seeing only codfw and eqiad in librenms [08:08:32] 6operations, 10MediaWiki-API, 7HHVM, 7Pywikibot-tests, 7Wikimedia-log-errors: internal_api_error_BadMethodCallException: [xxx] Exception Caught: Call to a member function getNames() on a non-object (NULL) - https://phabricator.wikimedia.org/T109929#1566442 (10jayvdb) @joe, the exact requests differ per w... [08:12:54] (03CR) 10Filippo Giunchedi: Change memcached icinga alert from anomaly to threshold (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/233071 (https://phabricator.wikimedia.org/T69817) (owner: 10BryanDavis) [08:28:05] 6operations, 10MediaWiki-API, 7HHVM, 7Pywikibot-tests, 7Wikimedia-log-errors: internal_api_error_BadMethodCallException: [xxx] Exception Caught: Call to a member function getNames() on a non-object (NULL) - https://phabricator.wikimedia.org/T109929#1566495 (10Joe) I just restarted HHVM on mw1144, and the... [08:28:41] 6operations, 10MediaWiki-API, 7HHVM, 7Pywikibot-tests, 7Wikimedia-log-errors: internal_api_error_BadMethodCallException: [xxx] Exception Caught: Call to a member function getNames() on a non-object (NULL) - https://phabricator.wikimedia.org/T109929#1566497 (10Joe) p:5Unbreak!>3Normal [08:31:04] !log cleaned up others lockdir for replication on labstore1002 and started it manually [08:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:32:31] (03PS3) 10Alexandros Kosiaris: cassandra: Mute strict puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/233073 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [08:32:36] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cassandra: Mute strict puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/233073 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [08:33:33] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1566515 (10Joe) So what we really need, in my understanding, is to introduce canary hosts in all our clusters, and also allow to identify... [08:34:54] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [08:35:20] I'm handling the labstore stuff [08:35:29] slightly bogus alert that, I thought that had been fixed [08:38:14] (03CR) 10Filippo Giunchedi: Add service deploy via scap (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 (owner: 10Thcipriani) [08:42:07] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1566536 (10hashar) Maybe we can give it a shot on #beta-cluster ? Though I am not sure whether the... [08:43:24] _joe_: commons just got a time machine - https://commons.wikimedia.org/wiki/File:Lyfe_Jennings_-_Cry_-_Live_at_The_Howard_Theatre.webm <-- see encoding time in the transcode status section [08:47:05] (03CR) 10Alexandros Kosiaris: [C: 031] pybal: switch healthchecks to Special:BlankPage [puppet] - 10https://gerrit.wikimedia.org/r/233053 (owner: 10BBlack) [08:52:26] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with public IP - https://phabricator.wikimedia.org/T109890#1566541 (10akosiaris) Yes we can. Effectively it's the same thing (as in all data will be lost) but from the scope of the cluster management tool (Ganeti) only a VM parameter will have been chan... [08:57:23] 6operations, 10Wikimedia-Mailing-lists: write migration plan for mailman - https://phabricator.wikimedia.org/T109467#1566543 (10JohnLewis) Since I cant edit comments, 14 is done and depends on 15. Since nothing else depends on fermium existence as is and other work is blocked, I'm going to give Alex an okay go... [08:58:04] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [08:59:02] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 51.85 ms [08:59:29] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with public IP - https://phabricator.wikimedia.org/T109890#1566549 (10JohnLewis) @akosiaris nothing is dependent on fermium, everything we need has been puppetised and only a single file exists in my home directory which will be lost (which can be). So... [09:00:13] RECOVERY - Disk space on labstore1002 is OK: DISK OK [09:00:13] !log others replication on labstore1002 completed successfuly [09:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:00:33] !log cleaning up lockdir on labstore for maps and tools [09:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:00:46] !log starting up replicate for tools on labstore1002 [09:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:05:30] 10Ops-Access-Requests, 6operations, 3Mobile-Content-Service: Add bsitzmann and mholloway as deployers for the MobileApps service - https://phabricator.wikimedia.org/T109855#1566577 (10akosiaris) a:5MoritzMuehlenhoff>3akosiaris Stealing as I am in clinic duty this week. Obviously this is fine, @bearND, @M... [09:06:03] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied [09:07:06] 10Ops-Access-Requests, 6operations, 3Mobile-Content-Service: Replace dbrant with mholloway for MobileApps production access - https://phabricator.wikimedia.org/T109857#1566586 (10akosiaris) a:5MoritzMuehlenhoff>3akosiaris Stealing as I am clinic duty. Seems fine. I think manager approval in T109855 shoul... [09:07:21] akosiaris: thanks for saying you can handle fermium when you have time. Be lovely to see the first public VM too :) [09:08:31] JohnFLewis: I am clinic duty this week, it's gonna be faster than "when I have time" ;-). I am also excited about this (I am sure there will be something I have not predicted) [09:09:14] (03PS1) 10Hashar: Pin mock<1.1.0 and add tox entry point [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/233360 [09:09:16] (03PS1) 10Hashar: pass flake8 and add entry point [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/233361 [09:09:26] 6operations, 6Labs: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1566600 (10yuvipanda) All better now, once the lockdirs were deleted. Not sure what the original cause of failure or the cause of the cascade was. [09:12:21] 6operations, 6Labs: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1566601 (10yuvipanda) we need to tighten up monitoring, and also provide actual documentation for how to recover from one. I've written up some notes at https://wikitech.wikimedia.org/wiki/NF... [09:13:09] akosiaris: fermium is also perfect because the whole vm resolves around doing something unpredictable, migrating mailman ;) [09:14:36] hehe [09:19:29] (03CR) 10Hashar: "check experimental" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/233360 (owner: 10Hashar) [09:19:47] (03CR) 10Hashar: "check experimental" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/233361 (owner: 10Hashar) [09:22:07] (03CR) 10Hashar: "The poor install does not run on Jenkins :-/" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/233360 (owner: 10Hashar) [09:23:07] (03PS1) 10Muehlenhoff: Replace dbrant with mholloway for mobileapps prod access [puppet] - 10https://gerrit.wikimedia.org/r/233364 (https://phabricator.wikimedia.org/T109857) [09:23:25] <_joe_> oh hashar, thanks [09:23:30] <_joe_> lemme see what fails there [09:23:46] there is some madness with mock>=1.1.0 requiring setup tools 17.1 [09:23:59] https://integration.wikimedia.org/ci/job/tox-py27-jessie/31/console [09:24:00] \O/ [09:24:03] (03CR) 10jenkins-bot: [V: 04-1] Replace dbrant with mholloway for mobileapps prod access [puppet] - 10https://gerrit.wikimedia.org/r/233364 (https://phabricator.wikimedia.org/T109857) (owner: 10Muehlenhoff) [09:24:29] <_joe_> hashar: thanks! [09:27:10] (03CR) 10Hashar: "check experimental" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/233360 (owner: 10Hashar) [09:27:37] (03CR) 10Hashar: "check experimental" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/233361 (owner: 10Hashar) [09:28:19] both fine. Whenever the changes land, I will make the jobs voting :-} [09:29:02] (03PS2) 10Muehlenhoff: Replace dbrant with mholloway for mobileapps prod access [puppet] - 10https://gerrit.wikimedia.org/r/233364 (https://phabricator.wikimedia.org/T109857) [09:34:42] (03PS7) 10Giuseppe Lavagetto: service: add deployment_script define [puppet] - 10https://gerrit.wikimedia.org/r/231790 [09:36:35] (03CR) 10Giuseppe Lavagetto: [C: 032] service: add deployment_script define [puppet] - 10https://gerrit.wikimedia.org/r/231790 (owner: 10Giuseppe Lavagetto) [09:51:05] akosiaris: i am unable to git pull anymore in the puppet repo. https://tools.wmflabs.org/paste/view/0cf60c65 [09:51:38] the latest upgrade i suspect that is related is upgrading to ssh 7.1 [09:51:46] (on my pc) [09:52:31] matanya: try that by sshing to Gerrit [09:53:03] (with -vvv if it fails too) [09:53:28] JohnFLewis: i did that, that is why i was suspecting this [09:54:29] Well that may be more informational than the git error. Post the output from that perhaps? [09:54:48] matanya: https://code.google.com/p/gerrit/issues/detail?id=3517 [09:55:24] thanks moritzm , pain in the ... [09:55:37] unfortunately gerrit uses jsch, which isn't up-to-date with current SSH crypto [09:55:50] hashar opened some tickets, but I don't remember the status [09:56:20] hashar: please add me as CC, if you have those under hand [09:56:36] moritzm: did you meet Lior kaplan at debconf ? [09:56:50] Jenkins has a similar issue as well [09:57:07] but it uses another SSH implementation which is definitely missing the recent ssh algorithms :/ [09:57:22] does phab have the latest stuff? [09:57:35] maybe we can allow pushing to the repos hosted there [09:57:36] the task for Jenkins is https://phabricator.wikimedia.org/T103351 [09:58:35] matanya: though looking at the bug linked by mortiz, OpenSSH says you can re-add them so it works on v7 [09:58:35] matanya: yes, he thinks PHP7 outperforms HHVM, I gave him Ori's email address to follow up on that :-) [09:59:01] moritzm: thanks, i did that in the past, that was interesting. :) [09:59:32] JohnFLewis: yes, i did that, but now i need to rememeber moving that in the future :) [09:59:52] matanya: so it works? Awesome :) [10:01:15] there is a PHP conference in France end of November geared toward PHP 7 [10:02:01] the would love to have someone to talk about hhvm. Heck maybe I should poke our ops list about [10:02:02] it [10:02:03] <_joe_> telling everyone how better than HHVM it is? [10:02:34] trolling is great :D [10:03:02] <_joe_> I mean people at the conference, I am pretty agnostic about it until I can see a stable release and test it [10:03:03] just saying :D [10:03:24] hashar: we could tell ori it is a performance conference, book him to talk about HHVM then when he turns up - PHP conference ;) [10:03:30] <_joe_> benchmarks are usually pointless, esp. if they don't involve real usage patterns [10:03:34] last year people asked me whether we evaluated more recent Zend versions than 5.3 we have been using [10:04:04] with limited knowledge, I told them hhvm offered addition features that PHP 5.x did no offer (such as the byte code cache / JIT) [10:04:10] apparently PHP 7 is going this way as well [10:04:33] PHP7: HHVM just not built by Facebook! [10:05:07] perl6 is where it's at [10:07:22] dead end ? [10:07:22] JohnFLewis: what needs to be done to sign L2 ? [10:07:34] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1566782 (10zeljkofilipin) @faidon I have left a comment in [[ https://gerrit.wikimedia.org/r/#/c/226898/ | gerrit ]], but I am not sure if you will noti... [10:08:07] (03CR) 10Zfilipin: "Left a more elaborate comment in phab https://phabricator.wikimedia.org/T102020#1566782" [puppet] - 10https://gerrit.wikimedia.org/r/226898 (owner: 10Faidon Liambotis) [10:08:27] matanya: https://phabricator.wikimedia.org/maniphest/task/create/?projects=WMF-NDA-Requests say you want to sign an NDA :) [10:09:43] matanya: https://phabricator.wikimedia.org/T108057 be informational too if you want like you already have an NDA [10:09:44] thanks JohnFLewis [10:11:49] Hi [10:11:53] Not seeing Images on en [10:12:02] And I haven't disabled them in the browser [10:12:04] Thoughts? [10:12:10] *enwikipedia [10:12:24] ShakespeareFan00: hello, it works for me [10:12:31] It doesn't for me [10:12:42] ShakespeareFan00: try force reloading? [10:12:43] are you on your private network [10:12:51] Nope [10:13:07] I've tried refreshing the page concerned several times [10:13:08] if that doesn't work, check the network tab (F12) to see if there is more information there [10:13:18] It's only one specific page? [10:13:22] <_joe_> ShakespeareFan00: can you point me to a specifc page, and a specific image you don't see? [10:13:41] https://en.wikipedia.org/wiki/Special:ListFiles [10:13:49] Wasn;t showing images [10:14:00] It's beewn doing this intermittently for a few days [10:14:30] <_joe_> ok that sounds like an imagescalers lag issue' [10:15:16] (03PS1) 10Muehlenhoff: Add Stas to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/233375 (https://phabricator.wikimedia.org/T109357) [10:15:29] _joe_: I've also been having issues with GIF images [10:15:42] (Ideally ALL the non animated ones should be PNG) [10:16:31] Example : - https://en.wikipedia.org/wiki/File:Voip-typical.gif [10:16:37] No image displayed [10:16:51] <_joe_> uhm I see that just fine [10:17:46] hashar: are you planning to send the ssh-issue to wikitech-l ? [10:17:49] ShakespeareFan00: http://upload.wikimedia.org/wikipedia/commons/4/44/Voip-typical.gif -- does that open for you? [10:18:11] valhallasw`cloud: Nope - Blank [10:18:41] Not even header data - Just a blank page [10:18:57] ShakespeareFan00: Press F12, open the network tab, then try again. What does that show? [10:18:59] <_joe_> ShakespeareFan00: https://upload.wikimedia.org/wikipedia/commons/4/44/Voip-typical.gif ? [10:19:32] Nothing [10:19:43] and nothing in the F12 dialog (console) either :( [10:19:55] ShakespeareFan00: network tab. [10:20:05] Yep, looked there as well [10:20:19] first go to the network tab, then reload. The order is important [10:21:53] Does the GET but I don't see anything in response [10:22:58] <_joe_> ShakespeareFan00: what browser are you using? [10:23:15] <_joe_> oh well [10:23:16] <_joe_> :) [10:24:02] OK You can yell like crazy at Avast [10:24:03] <_joe_> valhallasw`cloud: do you happen to remember which is the tool on toollabs that shows a thumb wall of the last uploaded images? [10:24:13] nope [10:24:18] Disabling the Avast plugin solved the GIF issue temporarily [10:24:21] <_joe_> valhallasw`cloud: thanks anyways [10:24:37] But having to do that is sympotmatic of something else being wrong [10:24:39] <_joe_> ShakespeareFan00: ah! interesting. I should scan that gif maybe [10:24:47] <_joe_> ShakespeareFan00: not really [10:24:56] _joe_: maybe listed in https://tools.wmflabs.org/hay/directory/#/search/image ? [10:25:03] Also i would suggets checking all the security certifcates are uptodate [10:25:19] (03PS1) 10Muehlenhoff: Add Erik Bernhardson to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/233376 (https://phabricator.wikimedia.org/T109356) [10:25:20] Avast in a recent update won't let you do https on an invalid certificate [10:25:23] 7Puppet, 10Continuous-Integration-Config: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1566829 (10zeljkofilipin) 3NEW [10:26:18] <_joe_> ShakespeareFan00: nope, I am pretty sure our certs are valid and up to date [10:26:40] <_joe_> globalsign had an issue with OCSP last week, so maybe your avast cached that [10:26:56] <_joe_> or someone is MITM'ing you to commons [10:27:24] <_joe_> but that would mean you cannot see any image on wikis [10:27:33] <_joe_> well, any image but the local ones [10:28:23] 6operations, 7discovery-system: Make puppet ca certificate world readable - https://phabricator.wikimedia.org/T110020#1566842 (10yuvipanda) 3NEW [10:28:30] _joe_: ^ for etcd / https for later. [10:31:06] And as I understand it The MITM is the Avast plugin [10:31:15] (I've had issues withit before) [10:32:16] 7Puppet, 10Beta-Cluster: Puppet failures on deployment-mx: can't find puppet://private/dkim/wikimedia.org-wiki-mail.key - https://phabricator.wikimedia.org/T87848#1566859 (10hashar) [10:32:40] 7Puppet, 10Beta-Cluster: Puppet failures on deployment-mx: can't find puppet://private/dkim/wikimedia.org-wiki-mail.key - https://phabricator.wikimedia.org/T87848#1000610 (10hashar) Still occurring. I have refreshed the puppet error output since we are now using `secret()`. [10:55:07] andere__; the presistent lowering of prio for commons related bus have become unbearable. Are you aware of this? It was also notcied by muntiple users. I won't be the bad boy who is yelling around, but if you low a priority please be so kind and add a explansion why. Especially because most of the commons users are not verry familar. Users are dissapointed about the rude ton and prio playing there at phab. Just telling, [10:55:16] *bugs [11:00:44] PROBLEM - puppet last run on ms-be2011 is CRITICAL puppet fail [11:03:03] Steinsplitter, this is not #operations related and if you want my attention I recommend to spend my name correctly. :) [11:03:13] Steinsplitter, if there is "rude tone", please provide specific examples. [11:04:02] Steinsplitter, apart from that, I don't think that you can always set a priority without even knowing how many people are affected etc. If you can provide such information initially it makes it way easier to set a proper priority. [11:04:04] andre__: you don't ansver my question about the priorization of bugs. [11:04:14] what is your question. That I'm aware of something? [11:04:40] you think that experienced users don't know how much users are affected? [11:05:04] Steinsplitter, they might, they might not. Depends on the problem. If they know they should clearly state in bug reports. [11:05:13] Don't make other people guess by hiding back information. [11:05:23] Plus that makes it way easier to interpret which priority an issue should have. [11:06:37] if something is broken somone schould look into it. users don't know how phab work. [11:06:54] Steinsplitter, How is that sentence related to what I wrote before? [11:07:03] Yes, please look into it when something is broken. [11:07:18] Yes, many people don't know how something (whether that's Phab or Wikipedia) works. How is that related? [11:07:29] then you come and change the status to needs triage or low. for commons related. [11:07:30] As a reminder, "rude tone" examples are very welcome. [11:07:53] 6operations, 7discovery-system: Make puppet ca certificate world readable - https://phabricator.wikimedia.org/T110020#1566915 (10yuvipanda) I am also not sure if I am using the right terminology here. /nick TLSn00b [11:07:54] Steinsplitter, yes, needs triage for stuff that you set to "high" without explain how many people are affected or if there is more than one testcase. [11:08:21] Steinsplitter, so that's part of realistic priority setting. Provide info that more users are affected and more than one testcase and I will leave it as "high priority". [11:08:30] and you set bugs to low without elaborating why. peole are confused and leving commons. t [11:08:40] If it's unclear how many folks / files are affected, I will reset to "Needs Triage" so the team will look at it. [11:09:06] something like "Not enough informations" prio --> low ---> triage sounds runde for non phabricatorians [11:09:30] Well, I do add comments explaining what more information I'd like to see. [11:09:40] (Which does not mean that I could explain my actions better if I had more time) [11:09:44] err couldn't [11:10:01] then a tech schould look into it. you can't ask a 70 year old man to provide moor data (just a example) [11:10:08] Yeah, "should". [11:10:10] Yes I can. [11:10:19] But I need to explain how. [11:10:22] of course I can. [11:10:30] there's an entire community who can help to get that data. [11:10:46] and until that data is provided, some stuff might not be high priority at all. But instead "needs triage". [11:11:11] in general, https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities [11:11:25] i won't critze or advocating. Just giving you a feedback. It is of course up to you how you do the work. [11:11:26] Steinsplitter, ...and still, "rude tone" examples are very welcome. [11:11:36] see my comment above. [11:11:42] well... Tone requires language. [11:11:46] 13:09:29 [11:12:02] so you don't mean some "rude tone" but instead "actions that some people might not understand" [11:12:20] Thanks for the feedback, definitely. [11:12:59] Steinsplitter, but when I see too many things set to high priority, especially when it's unclear (due to missing info provided) if they actually should be high priority, I will reset priority for the time being until there's enough info [11:13:10] ok [11:13:15] because otherwise everything will become high priority :-/ [11:14:01] Steinsplitter, how that explains it a little bit. and yes, in general bug reporting and not always getting quick answers or investigation is frustrating. I know that myself from other projects where I'm "just a clueless user" and try to report stuff [11:17:19] (03PS4) 10Muehlenhoff: ferm rules for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224095 (https://phabricator.wikimedia.org/T104962) [11:18:24] (03CR) 10Muehlenhoff: [C: 032 V: 032] ferm rules for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224095 (https://phabricator.wikimedia.org/T104962) (owner: 10Muehlenhoff) [11:29:54] RECOVERY - puppet last run on ms-be2011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:31:34] (03PS1) 10Giuseppe Lavagetto: etcd: remove etcd1001 from the list clients consume [dns] - 10https://gerrit.wikimedia.org/r/233384 [11:31:36] (03PS1) 10Giuseppe Lavagetto: etcd: remove etcd1001 from the list servers consume [dns] - 10https://gerrit.wikimedia.org/r/233385 [11:31:49] <_joe_> akosiaris, YuviPanda ^^ [11:32:07] so, first clients, make sure nothing is connected, then remove, then servers [11:32:30] <_joe_> yes [11:32:48] <_joe_> well things will stay connected if they have long-lasting connections like watches [11:33:23] not sure I follow the 4 changes though [11:33:28] do they handle disconnects well? [11:33:50] <_joe_> akosiaris: why? [11:33:58] <_joe_> I mean what you don't follow? [11:34:17] ah, one per DC ? [11:34:19] <_joe_> every dc has a local list of nodes, which for now are all the same [11:34:20] <_joe_> :) [11:34:23] ok [11:34:35] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] etcd: remove etcd1001 from the list clients consume [dns] - 10https://gerrit.wikimedia.org/r/233384 (owner: 10Giuseppe Lavagetto) [11:34:41] ok merging [11:34:48] <_joe_> yes, go on [11:35:01] <_joe_> ofc it will take at least 5 minutes for this to be effective [11:35:12] ok done. waiting 5 minutes now [11:36:24] <_joe_> we should probably make a backup right now [11:36:58] <_joe_> I'm doing it the quick and dirty way - using etcdumper - and the correct way - using etcdctl backup [11:42:10] moar documentation! [11:42:14] also lol etcd runs on windows [11:42:34] <_joe_> etcdumper is not properly installed on our cluster btw [11:42:45] <_joe_> because it lacks debian packaging [11:44:37] YuviPanda: lintian is not liking some stuff on mwparserfromhell [11:44:42] http://mentors.debian.net/package/mwparserfromhell [11:44:49] akosiaris: yeah, am fixing those now [11:44:55] akosiaris: those are all new ones from turning on the C module [11:44:56] ok [11:45:00] yup [11:45:08] so... many clients on etc1001 [11:45:46] just stop and have them reconnect somewhere else ? [11:45:58] _joe_: ^ ? [11:46:07] <_joe_> akosiaris: yeah 1 sec I'm verifying one thing [11:47:27] <_joe_> akosiaris: yeah basically confd will reconnect once we turn down the server [11:47:53] <_joe_> so we have two options: restart etcd on etcd1001 and verify no one tries to reconnect [11:48:07] <_joe_> or we just remove it from the cluster and stop it afterwards [11:48:16] <_joe_> btw, I should stop puppet there for now [11:48:19] (03PS5) 10Muehlenhoff: ferm rules for nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [11:48:22] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks ok, waiting manager approval on related task" [puppet] - 10https://gerrit.wikimedia.org/r/233364 (https://phabricator.wikimedia.org/T109857) (owner: 10Muehlenhoff) [11:49:31] <_joe_> akosiaris: what's your opinion? [11:49:39] <_joe_> I think we should restart it just to be sure [11:49:55] well, curtain #1 is the safe route, curtain #2 is the more adventurous [11:50:00] let's go for 1 on this one [11:50:07] <_joe_> ok [11:50:12] and we get adventurous on the other one [11:50:19] <_joe_> !log restarting etcd on etcd1001 [11:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:50:45] hehe [11:51:04] still seeing connections [11:51:19] I think confd is not checking DNS ? [11:51:30] as in it has it cached ? [11:51:30] <_joe_> yeah maybe [11:51:32] <_joe_> wtf [11:51:35] <_joe_> probably yes [11:51:54] ah, DNS client side caching [11:51:58] the gift that keepson giving? [11:52:06] <_joe_> yeah they're from the same few servers too [11:52:07] <_joe_> meh [11:52:33] akosiaris: fixed all the errors, btw :D [11:52:53] <_joe_> python-etcd would do the same btw - it will just check the cluster to have info on what the real servers are [11:53:00] YuviPanda: you are fast [11:53:15] _joe_: wanna try curtain #2 ? [11:53:28] <_joe_> akosiaris: yep, just fixing up one detail [11:54:09] (03PS1) 10Giuseppe Lavagetto: conftool: remove etcd1001 from the list of servers [puppet] - 10https://gerrit.wikimedia.org/r/233386 [11:54:31] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: remove etcd1001 from the list of servers [puppet] - 10https://gerrit.wikimedia.org/r/233386 (owner: 10Giuseppe Lavagetto) [11:54:38] (03CR) 10Giuseppe Lavagetto: [V: 032] conftool: remove etcd1001 from the list of servers [puppet] - 10https://gerrit.wikimedia.org/r/233386 (owner: 10Giuseppe Lavagetto) [11:54:48] hmm [11:55:32] <_joe_> yeah I know - the version of python-etcd we're using on precise still doesn't support srv-record discovery. I'll update that :) [11:55:36] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. I've added the handling for Redis to the existing patch. Daniel or Ori, could you please re-review the updated PS5 (sinc" [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [11:56:01] <_joe_> ok, so. someone else wants to remove the server from the cluster? :P [11:56:22] I 'll do it [11:56:49] <_joe_> so use conf1001 as your reference server [11:57:11] sudo etcdctl --ca-file /var/lib/etcd/ssl/certs/ca.pem -C https://conf1001.eqiad.wmnet:2379 member remove 460d53f044bf905e [11:57:11] Removed member 460d53f044bf905e from cluster [11:57:24] it instructed me to put the id there instead of the hostname [11:57:39] 3 followers now [11:57:41] <_joe_> oh, ehe [11:57:43] conf1003 is the leader [11:57:58] oh [11:58:00] connections gone [11:58:00] <_joe_> !log stopping etcd on etcd1001 [11:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:58:08] sigh [11:58:09] <_joe_> akosiaris: that was me actually [11:58:10] Steinsplitter: note that the priority on phabricator reflects the priority /for the team/ not the priority /for the reporting user/. It's a tool for team planning, not a tool to show how important the bug is to commons. [11:58:11] that was you [11:58:15] and I got happy [11:58:24] <_joe_> no client connections would [11:58:35] <_joe_> 've died if we were using python-etcd everywhere [11:58:42] Steinsplitter: to communicate the latter, please mention what kind of workflow the bug prevents, as this will be one of the factors the teams use to determine priority [11:58:48] <_joe_> as it checks the cluster for its members on any failure [11:59:02] ok, so we can kill that box now [11:59:04] <_joe_> akosiaris: did you check the cluster health? [11:59:08] <_joe_> yes! [11:59:08] but let's not for now [11:59:19] at least until we 've repeated this once more [11:59:24] <_joe_> I'll just remove the role from puppet [11:59:36] <_joe_> so that we don't get alerts [11:59:53] <_joe_> anyways the other servers wouldn't accept talking to it, so... [12:00:22] all 4 members are healthy [12:01:12] <_joe_> and confd on one of the machines that were connecting to etcd1001 is now happy [12:01:29] <_joe_> I'd call this a success :) [12:02:22] PROBLEM - service on etcd1001 is CRITICAL - Expecting active but unit is inactive [12:02:41] <_joe_> yeah well this is expected, right? [12:03:49] yup [12:03:52] ok [12:03:52] (03PS1) 10Giuseppe Lavagetto: etcd: remove role from etcd1001 [puppet] - 10https://gerrit.wikimedia.org/r/233388 [12:04:07] <_joe_> akosiaris: ^^ this will fix it [12:04:14] YuviPanda: left comments for ya on http://mentors.debian.net/package/mwparserfromhell [12:04:37] akosiaris: will fix! [12:04:40] (03CR) 10jenkins-bot: [V: 04-1] etcd: remove role from etcd1001 [puppet] - 10https://gerrit.wikimedia.org/r/233388 (owner: 10Giuseppe Lavagetto) [12:04:50] <_joe_> I know stupid jenkins-bot [12:04:57] I mostly put the description very short because meh on writing [12:05:02] akosiaris: but if that's all then wheeee :D [12:05:11] (03PS2) 10Giuseppe Lavagetto: etcd: remove role from etcd1001 [puppet] - 10https://gerrit.wikimedia.org/r/233388 [12:05:48] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: remove role from etcd1001 [puppet] - 10https://gerrit.wikimedia.org/r/233388 (owner: 10Giuseppe Lavagetto) [12:13:59] (03CR) 10Alexandros Kosiaris: [C: 032] etcd: remove etcd1001 from the list servers consume [dns] - 10https://gerrit.wikimedia.org/r/233385 (owner: 10Giuseppe Lavagetto) [12:16:48] (03PS1) 10Alexandros Kosiaris: Remove etcd clients records for etcd1002 [dns] - 10https://gerrit.wikimedia.org/r/233391 [12:16:50] (03PS1) 10Alexandros Kosiaris: Remove SRV server records for etcd1002 [dns] - 10https://gerrit.wikimedia.org/r/233392 [12:17:23] (03CR) 10Alexandros Kosiaris: [C: 032] Remove etcd clients records for etcd1002 [dns] - 10https://gerrit.wikimedia.org/r/233391 (owner: 10Alexandros Kosiaris) [12:28:54] _joe_: https://phabricator.wikimedia.org/T91468#1563545 in case you didn't see. [12:30:08] It's possible https://phabricator.wikimedia.org/T89918 is also fixed, but it's not totally clear to me if HHVM 3.6 is fully deployed and whether it contains the relevant patch. [12:32:11] Katie: AFAICS we are at 3.6.5+dfsg1-1+wm3 [12:32:40] Cool. [12:42:28] 7Puppet, 10Continuous-Integration-Config: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1567100 (10hashar) [12:42:31] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1567099 (10hashar) [12:43:13] 7Puppet, 10Continuous-Integration-Config: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1566829 (10hashar) I guess you want to write some tutorial / instructions for ops so they can run rubocop locally then announce the new... [12:44:00] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1353796 (10hashar) @zeljkofilipin yup any third parties code in operations/puppet should be ignored by rubocop. [12:45:45] 6operations, 7discovery-system: Make puppet ca certificate world readable - https://phabricator.wikimedia.org/T110020#1567110 (10yuvipanda) @Krenair just pointed out that this could possibly give anyone with shell access access to private puppet info. [12:53:01] 6operations, 7discovery-system: Make puppet ca certificate world readable - https://phabricator.wikimedia.org/T110020#1567138 (10yuvipanda) However, there's no private info in labs puppet, so we can safely turn it on for labs. [12:54:18] !log stop etcd on etcd1002.eqiad.wmnet. Already removed from the cluster [12:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:55:11] (03CR) 10Alexandros Kosiaris: [C: 032] Remove SRV server records for etcd1002 [dns] - 10https://gerrit.wikimedia.org/r/233392 (owner: 10Alexandros Kosiaris) [12:55:22] <_joe_> Katie: uhm, thanks, that bug should've been resolved a long time ago [12:56:52] PROBLEM - service on etcd1002 is CRITICAL - Expecting active but unit is inactive [12:57:02] known ^ [12:57:33] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 6 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1567157 (10Joe) [12:57:42] PROBLEM - Disk space on mw1142 is CRITICAL: DISK CRITICAL - free space: / 8167 MB (3% inode=93%) [12:59:12] (03PS1) 10Alexandros Kosiaris: etcd: remove role from etc1002 [puppet] - 10https://gerrit.wikimedia.org/r/233395 [13:00:42] (03CR) 10Giuseppe Lavagetto: [C: 031] etcd: remove role from etc1002 [puppet] - 10https://gerrit.wikimedia.org/r/233395 (owner: 10Alexandros Kosiaris) [13:05:00] _joe_: so, should I kill etcd100{1,2} from ganeti ? what's up btw with etcd1003 ? [13:05:34] (03CR) 10Alexandros Kosiaris: [C: 032] etcd: remove role from etc1002 [puppet] - 10https://gerrit.wikimedia.org/r/233395 (owner: 10Alexandros Kosiaris) [13:05:59] <_joe_> I think I killed it? [13:06:12] <_joe_> I surely removed it from the cluster [13:06:25] oh, it's ADMIN_down in ganeti [13:06:28] so, powered off [13:06:36] ok I 'll kill them all together [13:06:47] documentation time in ganeti ;-) [13:08:20] 7Blocked-on-Operations: Remove etcd100{1,2,3}.eqiad.wmnet from the fleet - https://phabricator.wikimedia.org/T110030#1567171 (10akosiaris) 3NEW a:3akosiaris [13:09:19] <_joe_> akosiaris: yeah I just powered it off [13:09:29] <_joe_> "just in case" [13:10:03] PROBLEM - puppet last run on etcd1002 is CRITICAL puppet fail [13:10:22] PROBLEM - puppet last run on etcd1001 is CRITICAL puppet fail [13:10:42] OH NO [13:10:49] PUPPET FAILED IN HOSTS THAT HAVE BEEN SHUT DOWN [13:10:54] !!! [13:10:55] THE SKY IS FALLING!!1 [13:11:02] nope, actually working [13:11:06] that is a correct puppet fail [13:11:10] oh? [13:11:24] race condition between your OH NO and my actually shutting them down [13:11:43] haha. so why did puppet fail? [13:12:04] RECOVERY - puppet last run on etcd1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:12:13] RECOVERY - puppet last run on etcd1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:12:27] I think it ran before clearing the roles [13:12:49] and failed and I just reenabled it [13:12:55] !sal [13:12:55] https://labsconsole.wikimedia.org/wiki/Server_Admin_Log see it and you will know all you need [13:13:09] !sal del [13:13:09] Successfully removed sal [13:13:29] !sal is https://labsconsole.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [13:13:30] Key was added [13:13:31] !sal [13:13:31] https://labsconsole.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [13:15:16] actually... [13:15:21] !sal del [13:15:21] Successfully removed sal [13:15:32] ah, labsconsole [13:15:42] !sal is https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [13:15:42] Key was added [13:15:43] ah yeah that no more exists :D [13:16:11] not that I 've ever seen anyone use that key [13:16:22] hashar today was the first one for me [13:16:29] I do on #wikimedia-releng [13:16:30] <_joe_> me neither [13:16:34] <_joe_> !wat [13:16:36] https://tools.wmflabs.org/sal/sudo ha [13:16:37] cause I am super lazy [13:17:12] <_joe_> this doesn't work!!1! [13:18:01] http://bots.wmflabs.org/dump/%23wikimedia-operations.htm [13:18:01] @info [13:18:26] !password [13:18:26] gfgjoagaewhgAW#YAU_#YU*U^*^%Q#Tqyhe [13:18:28] https://tools.wmflabs.org/sal/projects [13:18:51] can we get a mixed view of all SALs combined? [13:20:49] (03PS1) 10BBlack: update-ocsp: error output bugfix, no Popen.cmd [puppet] - 10https://gerrit.wikimedia.org/r/233397 [13:21:00] bblack: hi, does sysctl reloading on trusty working via /usr/sbin/service procps start ? [13:21:31] last I heard, I think there's some question marks about that [13:22:03] i tested it today, and it failed most of time [13:22:08] only most? [13:22:41] i did stop, but failed to start [13:22:45] *it did [13:23:01] it's not a true daemon anyways, so "stop" doesn't even make sense [13:23:14] right [13:23:32] why no just exec sysctl -p ? [13:24:45] I guess because procps is the official way they're applied at bootup? I don't know [13:25:30] i will leave it to andrewbogott to figure :) [13:25:59] whatever's broken from it has been broken for a long time. we might find surprises if we suddenly fix it on a wide class of hosts [13:26:53] (03CR) 10BBlack: [C: 032] update-ocsp: error output bugfix, no Popen.cmd [puppet] - 10https://gerrit.wikimedia.org/r/233397 (owner: 10BBlack) [13:27:06] (03CR) 10Ottomata: [C: 031] Add ferm rules for jmxtrans/impala [puppet] - 10https://gerrit.wikimedia.org/r/229704 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [13:27:18] (03CR) 10Ottomata: [C: 031] Enable base::firewall on analytics1026 [puppet] - 10https://gerrit.wikimedia.org/r/229705 (owner: 10Muehlenhoff) [13:28:10] 6operations, 6Labs: labs salt master on jessie fails to install salt-master - https://phabricator.wikimedia.org/T110032#1567209 (10fgiunchedi) 3NEW [13:28:45] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1567220 (10zeljkofilipin) @hashar: any idea on which folders contain third party code? [13:29:25] zeljkof: no idea :-) check with ops! [13:29:33] (03Abandoned) 10John F. Lewis: fermium: add mapped ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/230239 (owner: 10John F. Lewis) [13:29:56] (03PS2) 10John F. Lewis: lists: add service IPs for lists on fermium [dns] - 10https://gerrit.wikimedia.org/r/233050 [13:30:22] bblack: mind giving https://gerrit.wikimedia.org/r/#/c/233050/ a check over? (when you have time) [13:30:31] well, if anybody can help with this, much appreciated https://phabricator.wikimedia.org/T102020#1567220 [13:30:49] 10Ops-Access-Requests, 6operations, 3Mobile-Content-Service: Add bsitzmann and mholloway as deployers for the MobileApps service - https://phabricator.wikimedia.org/T109855#1567222 (10Mholloway) Pinging @dr0ptp4kt for manager approval. [13:30:57] 6operations, 6Labs: labs salt master on jessie fails to install salt-master - https://phabricator.wikimedia.org/T110032#1567224 (10fgiunchedi) [13:36:39] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 2 below the confidence bounds [13:36:48] 7Puppet, 10Continuous-Integration-Config: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1567243 (10hashar) p:5Triage>3Normal [13:39:41] 6operations, 6Labs: labs salt master on jessie fails to install salt-master - https://phabricator.wikimedia.org/T110032#1567257 (10fgiunchedi) it seems due to a version mismatch from what gets installed on the image by default (since `grep salt-minion /var/log/dpkg.log*` yields no results) ```lines=15 filippo... [13:42:44] (03PS2) 10Muehlenhoff: Add ferm rules for jmxtrans/impala [puppet] - 10https://gerrit.wikimedia.org/r/229704 (https://phabricator.wikimedia.org/T83597) [13:42:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for jmxtrans/impala [puppet] - 10https://gerrit.wikimedia.org/r/229704 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [13:43:56] matanya: what will I figure out? [13:45:10] PROBLEM - check_apache2 on payments1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:45:10] PROBLEM - check_apache2 on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:45:10] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1001.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [13:45:11] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [13:45:19] PROBLEM - check_apache2 on payments1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:45:19] PROBLEM - check_payments_wiki on payments1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1003.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [13:48:59] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:50:09] PROBLEM - check_apache2 on payments1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:50:09] PROBLEM - check_apache2 on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:50:09] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [13:50:10] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1001.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [13:50:10] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures [13:50:19] PROBLEM - check_apache2 on payments1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:50:19] PROBLEM - check_payments_wiki on payments1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1003.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [13:51:21] andrewbogott: well with the timing, what the spam above is :) [13:52:02] (03PS1) 10Ottomata: Include kafka10(13|14|20) in the analytics Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/233399 [13:53:49] PROBLEM - check_apache2 on payments2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:53:49] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 1 failures [13:53:59] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [13:53:59] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:54:00] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:54:21] (03PS1) 10Ottomata: Use require_package instead of ensure_package in role/analytics/hadoop.pp [puppet] - 10https://gerrit.wikimedia.org/r/233400 [13:54:35] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1567306 (10akosiaris) >>! In T96017#1485715, @mobrovac wrote: >>>! In T96017#1485685, @MoritzMuehlenhoff wrote: >> xulrunner is only present in Wheezy, starting with 31, Firefox/Iceweasel... [13:54:44] (03PS2) 10Ottomata: Use require_package instead of ensure_package in role/analytics/hadoop.pp [puppet] - 10https://gerrit.wikimedia.org/r/233400 [13:54:50] JohnFLewis: that looks like FR stuff, maybe a Jeff_Green issue? [13:55:03] oh so it is FR :) [13:55:09] PROBLEM - check_apache2 on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:55:09] PROBLEM - check_apache2 on payments1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:55:10] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1001.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [13:55:10] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures [13:55:10] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [13:55:19] PROBLEM - check_apache2 on payments1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:55:19] PROBLEM - check_payments_wiki on payments1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1003.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [13:55:41] !log disable puppet on fermium [13:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:48] (03CR) 10Ottomata: [C: 032] Use require_package instead of ensure_package in role/analytics/hadoop.pp [puppet] - 10https://gerrit.wikimedia.org/r/233400 (owner: 10Ottomata) [13:55:55] !log disable puppet on fermium preparing for reinstallation [13:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:05] +1 akosiaris :) [13:57:44] (03PS2) 10Ottomata: Include kafka10(13|14|20) in the analytics Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/233399 [13:57:47] andrewbogott: can you sms green? [13:57:54] (or someone else from the US) [13:58:05] do we still have any Solaris? [13:58:07] err, Jeff_Green [13:58:17] I will try. Are we sure those are FR things? [13:58:23] andrewbogott: yes, payments**** is [13:58:45] looking [13:58:49] PROBLEM - check_apache2 on payments2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:58:49] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 1 failures [13:58:59] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:58:59] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [13:59:06] Jeff_Green: thanks :) [13:59:09] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [13:59:12] (03CR) 10Ottomata: [C: 032] Include kafka10(13|14|20) in the analytics Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/233399 (owner: 10Ottomata) [13:59:17] (03PS2) 10BBlack: mobile vcl: tighten up disableImages cookie regex [puppet] - 10https://gerrit.wikimedia.org/r/232945 (https://phabricator.wikimedia.org/T109286) [13:59:41] (03CR) 10BBlack: [C: 032] mobile vcl: tighten up disableImages cookie regex [puppet] - 10https://gerrit.wikimedia.org/r/232945 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [13:59:50] (03CR) 10BBlack: [V: 032] mobile vcl: tighten up disableImages cookie regex [puppet] - 10https://gerrit.wikimedia.org/r/232945 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [14:00:09] PROBLEM - check_apache2 on payments1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:00:10] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [14:00:10] PROBLEM - check_apache2 on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:00:10] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1001.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [14:00:10] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures [14:00:19] PROBLEM - check_apache2 on payments1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:00:19] PROBLEM - check_payments_wiki on payments1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1003.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.014 second response time [14:00:19] PROBLEM - check_puppetrun on payments1003 is CRITICAL Puppet has 1 failures [14:00:31] (03PS2) 10BBlack: Fix minor spelling mistake [puppet] - 10https://gerrit.wikimedia.org/r/233118 (owner: 10Southparkfan) [14:00:38] (03CR) 10BBlack: [C: 032 V: 032] Fix minor spelling mistake [puppet] - 10https://gerrit.wikimedia.org/r/233118 (owner: 10Southparkfan) [14:00:51] (03PS2) 10BBlack: pybal: switch healthchecks to Special:BlankPage [puppet] - 10https://gerrit.wikimedia.org/r/233053 [14:01:02] 10Ops-Access-Requests, 6operations, 3Mobile-Content-Service: Add bsitzmann and mholloway as deployers for the MobileApps service - https://phabricator.wikimedia.org/T109855#1567337 (10MoritzMuehlenhoff) FWIW, I had written a mail to Toby Negrin this morning asking for manager approval. [14:01:24] (03CR) 10BBlack: [C: 032 V: 032] pybal: switch healthchecks to Special:BlankPage [puppet] - 10https://gerrit.wikimedia.org/r/233053 (owner: 10BBlack) [14:01:26] YuviPanda: texted [14:02:05] andrewbogott: he's already here btw :) [14:02:23] well, his bouncer is [14:02:29] jynus: I was told some months Solaris is gone (although there are still wikitech pages referring to it, IIRC) [14:02:38] moritzm, thanks [14:02:47] andrewbogott: 14:58 looking [14:02:58] oh, ok [14:03:03] Jeff_Green: sorry for the double alert :) [14:03:04] (03PS2) 10Muehlenhoff: Enable base::firewall on analytics1026 [puppet] - 10https://gerrit.wikimedia.org/r/229705 [14:03:38] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on analytics1026 [puppet] - 10https://gerrit.wikimedia.org/r/229705 (owner: 10Muehlenhoff) [14:03:43] PROBLEM - check_apache2 on payments2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:03:43] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 1 failures [14:04:03] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:04:03] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [14:04:03] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:04:12] 6operations, 10MediaWiki-API, 7HHVM, 7Pywikibot-tests, 7Wikimedia-log-errors: internal_api_error_BadMethodCallException: [xxx] Exception Caught: Call to a member function getNames() on a non-object (NULL) - https://phabricator.wikimedia.org/T109929#1567349 (10Anomie) @Joe: FYI, you should be able to grep... [14:05:01] andrewbogott: no, thank you for the double alert. I had stepped away from the keyboard just after causing puppet malfeasance [14:05:13] PROBLEM - check_apache2 on payments1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:05:13] PROBLEM - check_apache2 on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:05:13] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [14:05:14] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1001.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [14:05:14] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures [14:05:14] PROBLEM - check_apache2 on payments1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:05:14] PROBLEM - check_payments_wiki on payments1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1002.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [14:05:14] PROBLEM - check_apache2 on payments1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:05:15] PROBLEM - check_payments_wiki on payments1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1003.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [14:05:15] PROBLEM - check_puppetrun on payments1003 is CRITICAL Puppet has 1 failures [14:08:45] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with jessie and public IP - https://phabricator.wikimedia.org/T109924#1567379 (10akosiaris) [14:08:47] (03PS1) 10Andrew Bogott: Install wmf salt version rather than setting up the upstream repo. [puppet] - 10https://gerrit.wikimedia.org/r/233403 (https://phabricator.wikimedia.org/T110036) [14:08:47] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with public IP - https://phabricator.wikimedia.org/T109890#1567380 (10akosiaris) [14:08:53] PROBLEM - check_apache2 on payments2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:08:53] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 1 failures [14:09:03] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:09:03] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [14:09:04] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:09:42] 6operations, 6Labs: labs salt master on jessie fails to install salt-master - https://phabricator.wikimedia.org/T110032#1567388 (10Andrew) [14:10:10] (03PS2) 10Andrew Bogott: Install wmf salt version rather than setting up the upstream repo. [puppet] - 10https://gerrit.wikimedia.org/r/233403 (https://phabricator.wikimedia.org/T110032) [14:10:13] RECOVERY - check_apache2 on payments1004 is OK: PROCS OK: 7 processes with command name apache2 [14:10:13] RECOVERY - check_apache2 on payments1002 is OK: PROCS OK: 9 processes with command name apache2 [14:10:13] RECOVERY - check_apache2 on payments1001 is OK: PROCS OK: 9 processes with command name apache2 [14:10:14] RECOVERY - check_payments_wiki on payments1004 is OK: HTTP OK: HTTP/1.1 200 OK - 249 bytes in 0.035 second response time [14:10:14] RECOVERY - check_payments_wiki on payments1002 is OK: HTTP OK: HTTP/1.1 200 OK - 249 bytes in 0.036 second response time [14:10:14] RECOVERY - check_payments_wiki on payments1001 is OK: HTTP OK: HTTP/1.1 200 OK - 249 bytes in 0.036 second response time [14:10:14] RECOVERY - check_puppetrun on payments1001 is OK Puppet is currently enabled, last run 294 seconds ago with 0 failures [14:10:14] PROBLEM - check_apache2 on payments1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:10:15] PROBLEM - check_payments_wiki on payments1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1003.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [14:10:16] PROBLEM - check_puppetrun on payments1003 is CRITICAL Puppet has 1 failures [14:11:06] (03CR) 10Filippo Giunchedi: [C: 031] Install wmf salt version rather than setting up the upstream repo. [puppet] - 10https://gerrit.wikimedia.org/r/233403 (https://phabricator.wikimedia.org/T110032) (owner: 10Andrew Bogott) [14:11:49] something weird is going on - took several minutes for "git deploy sync -- fetch" to pull 2/4, and now its stuck [14:11:53] (kartotherian) [14:12:03] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.67% of data above the critical threshold [1000.0] [14:12:29] ok, 3 min later all 4 synced [14:12:44] !log git deploy synced kartotherian [14:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:12:53] RECOVERY - Disk space on labstore1002 is OK: DISK OK [14:13:43] PROBLEM - check_apache2 on payments2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:13:43] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 1 failures [14:14:03] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:14:03] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [14:14:04] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:14:04] PROBLEM - check_puppetrun on payments2001 is CRITICAL Puppet has 1 failures [14:15:13] PROBLEM - check_apache2 on payments1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:15:13] PROBLEM - check_payments_wiki on payments1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on https://payments1003.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 311 bytes in 0.011 second response time [14:15:13] PROBLEM - check_puppetrun on payments1003 is CRITICAL Puppet has 1 failures [14:18:53] RECOVERY - check_apache2 on payments2003 is OK: PROCS OK: 6 processes with command name apache2 [14:18:53] RECOVERY - check_puppetrun on payments2003 is OK Puppet is currently enabled, last run 73 seconds ago with 0 failures [14:18:55] !log starting to move kafka topic-partitions to new brokers (and off of analytics1021) [14:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:03] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:19:03] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [14:19:03] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:19:04] PROBLEM - check_puppetrun on payments2001 is CRITICAL Puppet has 1 failures [14:20:13] RECOVERY - check_apache2 on payments1003 is OK: PROCS OK: 6 processes with command name apache2 [14:20:13] RECOVERY - check_payments_wiki on payments1003 is OK: HTTP OK: HTTP/1.1 200 OK - 249 bytes in 0.037 second response time [14:20:13] RECOVERY - check_puppetrun on payments1003 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:24:03] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:24:03] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [14:24:03] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [14:24:03] PROBLEM - check_puppetrun on payments2001 is CRITICAL Puppet has 1 failures [14:26:22] 6operations, 6Labs, 5Patch-For-Review: labs salt master on jessie fails to install salt-master - https://phabricator.wikimedia.org/T110032#1567462 (10fgiunchedi) for already existing instances, as suggested by @valhallasw, this can be fixed via project-wide hiera with `"salt::master::salt_version": 2014.7.5+... [14:28:15] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL 11.11% of data above the critical threshold [5000000.0] [14:28:16] ottomata: could you have a look at https://gerrit.wikimedia.org/r/#/c/233068/? And also maybe comment on the state of https://phabricator.wikimedia.org/T108987? [14:29:03] RECOVERY - check_apache2 on payments2002 is OK: PROCS OK: 6 processes with command name apache2 [14:29:03] RECOVERY - check_puppetrun on payments2002 is OK Puppet is currently enabled, last run 126 seconds ago with 0 failures [14:29:03] RECOVERY - check_apache2 on payments2001 is OK: PROCS OK: 6 processes with command name apache2 [14:29:04] RECOVERY - check_puppetrun on payments2001 is OK Puppet is currently enabled, last run 149 seconds ago with 0 failures [14:29:15] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL 11.11% of data above the critical threshold [5000000.0] [14:30:06] this si ok, i should downtime for htat [14:30:11] on it [14:32:08] bblack, hi, is there a way i can do varnish cache purge for a specific url for maps at this point? possibly from command line [14:38:23] andrewbogott: the sysctl issue on trusty [14:38:40] matanya: I don’t know what that is [14:38:49] 6operations, 6Performance-Team, 10Traffic, 7Performance: Optimize prod's resource domains for SPDY/HTTP2 - https://phabricator.wikimedia.org/T94896#1567536 (10Krinkle) Any pending tasks here or is this resolved? [14:39:02] https://phabricator.wikimedia.org/T109711 [14:39:13] andrewbogott: ^ [14:39:43] oh, that :( [14:40:05] Yeah, the puppet class just needs an exec I think [14:40:14] yurik: write a commandline HTCP client? [14:42:51] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL 33.33% of data above the critical threshold [5000000.0] [14:43:14] gah, puppet beat me to it. made that new alert before I could schedule downtime [14:43:31] 6operations, 6Performance-Team, 10Traffic, 7Performance: Optimize prod's resource domains for SPDY/HTTP2 - https://phabricator.wikimedia.org/T94896#1567576 (10BBlack) Well this basically got solved along the way while doing other things. We've flipped back to using a unified cert that covers all the proje... [14:44:32] 6operations, 6Performance-Team, 10Traffic, 7Performance: Optimize prod's resource domains for SPDY/HTTP2 - https://phabricator.wikimedia.org/T94896#1567588 (10BBlack) I should note that mobile probably still has some coalescing to gain, but it's not the driver for those solutions anyways, which will get ad... [14:48:06] !log forcing wikitech logouts in order to flush everyone’s service catalog [14:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:48:11] andrewbogott: just the two of us for the sprint meeting. should we still meet or do you want to just do this async? [14:48:25] YuviPanda: async is better for me if you don’t mind. [14:49:03] andrewbogott: +1, let me do an email thread [14:49:07] (with links) [14:50:32] andrewbogott: https://phabricator.wikimedia.org/project/board/1456/ [14:52:37] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 5 others: Try to fail over to labnet1002 - https://phabricator.wikimedia.org/T109329#1546058 (10Andrew) This is done, but we need to do a bit more research and documentation so we don't forget what we learned in the switchover. [14:53:23] 6operations, 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108, and 3 others: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1567621 (10Andrew) 5Open>3Resolved a:3Andrew All labvirt hosts are now running 3.16 kernels, and puppet now actively excludes the known-buggy k... [14:53:35] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 4 others: move nova api to labnet1002 - https://phabricator.wikimedia.org/T109653#1567626 (10Andrew) 5Open>3Resolved [14:53:36] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 5 others: Try to fail over to labnet1002 - https://phabricator.wikimedia.org/T109329#1567627 (10Andrew) [14:53:48] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 4 others: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1567635 (10Andrew) 5stalled>3Resolved [14:55:51] andrewbogott: :D thanks! Weren't you going to do a OpenStack upgrade this wednesday? [14:56:02] Yep [14:56:07] need to open a task for that though [14:56:36] andrewbogott: cool :) any new features we get off just moving to Juno? [14:56:54] 6operations, 10Wikimedia-Mailing-lists: add public IP for fermium - DNS and DHCP change for reinstall - https://phabricator.wikimedia.org/T109923#1567665 (10akosiaris) [14:56:55] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with public IP - https://phabricator.wikimedia.org/T109890#1567664 (10akosiaris) [14:57:12] I’ve barely payed attention to Juno, it’s just a stepping stone to Kilo that gets us some features I really care about — Horizon GUI for DNS and Ceilometer. [14:57:26] andrewbogott: aaah, sweet :D [14:57:40] so if the Juno goes well do we have other blockers for Kilo? [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150824T1500). Please do the needful. [15:00:04] Krenair aude: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:15] No, I’m hoping to do Kilo the following week, presuming there are now surprises. [15:00:33] coool :) [15:00:36] aude, want to go first? [15:00:51] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [15:01:05] (03PS1) 10Hashar: elasticsearch: ensure /var/run subdir exists [puppet] - 10https://gerrit.wikimedia.org/r/233413 (https://phabricator.wikimedia.org/T109497) [15:02:18] Krenair: ok [15:02:33] probably will take jenkins some time.... [15:02:44] 6operations, 10ops-codfw: ms-be2009 - RAID degraded / failed disk - https://phabricator.wikimedia.org/T107877#1567712 (10Papaul) @fgiunchedi Drive replacement complete. [15:03:37] ori: bblack: I think I finally found why a seemingly unrelated change to mobile mediawiki-config caused load.php load to almost double. Remember that unexplained spike from a few weeks ago in rl stats, which we narrowed down to Ori's config change for wgLoadScript (to use m-domain instead of desktop canonical domain). I think may have have to do with the fact that mobile doesn't stash cookies fo [15:03:37] r /w/load.php [15:03:37] https://gerrit.wikimedia.org/r/#/c/232516/4/templates/varnish/text-common.inc.vcl.erb [15:04:33] (03CR) 10Hashar: [C: 031] "Cherry picked on integration puppet master. That is really just a workaround to prevent puppet from deadlocking while attempting to start " [puppet] - 10https://gerrit.wikimedia.org/r/233413 (https://phabricator.wikimedia.org/T109497) (owner: 10Hashar) [15:05:46] Krinkle: that's not the current code [15:05:58] bblack: I know, but that's just refactoring right? [15:06:07] w/load.php was never included there, right? [15:06:15] It's just the commit that made be see that ode. [15:06:16] code [15:06:26] the code you're linking to was a temporary state during related changes about this exact problem. the historical code before last week is completely different (and was wrong, IMHO) [15:06:27] It's more obvious now, that's all [15:07:16] bblack: both before and after that commit, mobile did not stash cookies for load.php, correct? [15:07:49] correct, but now it does [15:07:55] since when exactly? [15:08:04] but they way it did it before, there were a lot of differences from how text does things [15:08:10] before recent changes, it looked like this: https://github.com/wikimedia/operations-puppet/blob/c3919a29c3dbf92c0092227e0de221cf525b26b3/templates/varnish/mobile-frontend.inc.vcl.erb#L67 [15:08:18] Yes [15:08:42] so, it wasn't stashing cookies for load.php, but it also was failing to even pay attention to Token, was only watching Session [15:08:50] Yes [15:09:01] and it still does, just moved to a function evaluate_cookie_mobile in text-common [15:09:40] (03PS1) 10Alexandros Kosiaris: Assign fermium public IPs. IPv4 and IPv6 [dns] - 10https://gerrit.wikimedia.org/r/233414 (https://phabricator.wikimedia.org/T109923) [15:09:41] Krenair: want me to +2? [15:09:52] Krinkle: https://gerrit.wikimedia.org/r/#/c/232638/4 [15:10:15] ^ is the change that aligned them and stashed for load.php [15:10:39] aude, my changes? [15:10:45] are you done? [15:10:46] so sometime Thursday [15:10:53] Krenair: my change [15:10:56] Thursday morning US time I think? [15:11:24] * aude realized the core submodule updates are automatic when jenkins merges the extension change [15:11:28] realizes* [15:11:35] aude, yes, sorry, I thought you were deploying your change [15:11:41] oh, no [15:11:49] i can if you want [15:11:59] * aude gives +2 [15:12:12] and then we wait and you can do your changes while waiting [15:12:25] I'm not sure about wikibase/composer stuff, so... [15:13:04] (03CR) 10Alex Monk: [C: 032] Fix noc.wikimedia.org/db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232920 (https://phabricator.wikimedia.org/T109045) (owner: 10Alex Monk) [15:13:09] Krenair: ok [15:13:12] (03CR) 10jenkins-bot: [V: 04-1] Fix noc.wikimedia.org/db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232920 (https://phabricator.wikimedia.org/T109045) (owner: 10Alex Monk) [15:13:16] (03PS3) 10John F. Lewis: lists: add service IPs for lists on fermium [dns] - 10https://gerrit.wikimedia.org/r/233050 [15:13:33] 15:13:08 stderr: error: object file .git/objects/38/ba7f77676e6afe4b26f8e1aaf9e0e96c26a908 is empty [15:13:33] 15:13:08 fatal: loose object 38ba7f77676e6afe4b26f8e1aaf9e0e96c26a908 (stored in .git/objects/38/ba7f77676e6afe4b26f8e1aaf9e0e96c26a908) is corrupt [15:13:35] (03PS4) 10John F. Lewis: lists: add service IPs for lists on fermium [dns] - 10https://gerrit.wikimedia.org/r/233050 [15:13:54] (03CR) 10John F. Lewis: "modified to 75 as .74 was taken in https://gerrit.wikimedia.org/r/#/c/233414/1 by fermium itself" [dns] - 10https://gerrit.wikimedia.org/r/233050 (owner: 10John F. Lewis) [15:14:03] composer.lock looks sane (https://gerrit.wikimedia.org/r/#/c/233401/ = b607ec094cecf02c937a9ddbe8f48c4c046185f7) [15:14:19] (03CR) 10Alex Monk: [V: 032] Fix noc.wikimedia.org/db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232920 (https://phabricator.wikimedia.org/T109045) (owner: 10Alex Monk) [15:14:58] !log apt-get upgrade on gallium [15:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:16] (03PS3) 10John F. Lewis: fermium: add service IPs to hiera [puppet] - 10https://gerrit.wikimedia.org/r/233052 [15:15:28] (03PS4) 10John F. Lewis: fermium: add service IPs to hiera [puppet] - 10https://gerrit.wikimedia.org/r/233052 [15:16:02] (03CR) 10John F. Lewis: "added fermium's IP and changed the IPv4 for lists following fermium's taking of it." [puppet] - 10https://gerrit.wikimedia.org/r/233052 (owner: 10John F. Lewis) [15:16:37] !log krenair@tin Synchronized docroot/noc/db.php: https://gerrit.wikimedia.org/r/#/c/232920/ (duration: 01m 34s) [15:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:49] hmm, okay [15:16:53] new host failures [15:17:09] PROBLEM - puppet last run on mw2201 is CRITICAL puppet fail [15:17:10] No space left on device from mw1010 [15:17:31] and connection timed out to mw2180 [15:17:59] :( [15:18:29] Can't ping mw2180 either [15:19:49] !log No space left on mw1010, cannot ping or ssh to mw2180 [15:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:38] PROBLEM - DPKG on gallium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:20:53] hashar, that you? ^ [15:21:16] Krenair: yes :( [15:21:17] (03PS1) 10ArielGlenn: dumps: redo handling of jobs with unrun prereqs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/233417 [15:21:54] (03CR) 10ArielGlenn: "untested, no random merges please." [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/233417 (owner: 10ArielGlenn) [15:23:14] (03CR) 10John F. Lewis: [C: 031] "looks good though the second fermium in wikimedia.org is not really needed." [dns] - 10https://gerrit.wikimedia.org/r/233414 (https://phabricator.wikimedia.org/T109923) (owner: 10Alexandros Kosiaris) [15:24:06] (03PS1) 10ArielGlenn: dumps: tweak stages a bit [puppet] - 10https://gerrit.wikimedia.org/r/233418 [15:25:02] Krenair: when ready, my patch is merged [15:25:37] (03PS2) 10Alex Monk: Book namespaces for Urdu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232915 (https://phabricator.wikimedia.org/T109505) [15:25:41] (03CR) 10Alex Monk: [C: 032] Book namespaces for Urdu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232915 (https://phabricator.wikimedia.org/T109505) (owner: 10Alex Monk) [15:25:46] (03CR) 10jenkins-bot: [V: 04-1] Book namespaces for Urdu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232915 (https://phabricator.wikimedia.org/T109505) (owner: 10Alex Monk) [15:25:48] (03PS2) 10Alex Monk: Localise Kannada Wikiquote logo and site name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232919 (https://phabricator.wikimedia.org/T104260) [15:25:53] (03CR) 10jenkins-bot: [V: 04-1] Localise Kannada Wikiquote logo and site name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232919 (https://phabricator.wikimedia.org/T104260) (owner: 10Alex Monk) [15:26:17] (03CR) 10Alex Monk: [V: 032] Book namespaces for Urdu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232915 (https://phabricator.wikimedia.org/T109505) (owner: 10Alex Monk) [15:26:25] yurik: I think tile generation is almost done. can we switch to the system tilerator service instead of the one running from your home dir now ? [15:26:30] (03CR) 10Alex Monk: [C: 032 V: 032] Localise Kannada Wikiquote logo and site name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232919 (https://phabricator.wikimedia.org/T104260) (owner: 10Alex Monk) [15:27:19] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [15:29:24] !log krenair@tin Synchronized w/static/images/project-logos/knwikiquote.png: https://gerrit.wikimedia.org/r/#/c/232919/ (duration: 02m 04s) [15:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:25] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/232919/ and https://gerrit.wikimedia.org/r/#/c/232915/ (duration: 01m 34s) [15:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:03] aude, so do you want to sync the wikidata change, or do you want me to do it? [15:33:13] Krenair: you can [15:33:19] PROBLEM - puppet last run on gallium is CRITICAL Puppet has 4 failures [15:34:18] looking at gallium still [15:34:33] blerg. there are 5 branches worth of l10n on the cluster. that's probably related to mw1010 running out of disk [15:35:23] aude, so does it matter which order these files are done in? or can we just sync-dir php-1.26wmf19/extensions/Wikidata ? [15:35:59] The php file change looks fine to me to do on it's own, but I have no clue about the composer parts [15:36:12] cp: cannot create regular file `/usr/share/python/zuul/bin/python2.7': Text file busy [15:36:12] yuouuuu [15:36:47] Krenair: running `scap-purge-l10n-cache --version 1.26wmf15` and repeating for 16, 17, and 18 will free a lot of disk [15:37:05] Krenair: i always do sync-dir [15:37:36] twentyafterfour should really be doing that l10n purge as part of the train deploys [15:37:39] composer.lock is just tells me what version we have of everything [15:37:39] !log stopped and restarted Zuul [15:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:40] RECOVERY - DPKG on gallium is OK: All packages OK [15:38:41] !log krenair@tin Synchronized php-1.26wmf19/extensions/Wikidata: https://gerrit.wikimedia.org/r/#/c/233411/1 (duration: 00m 49s) [15:38:44] bd808, what happens when a scap proxy has a sync error? [15:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:53] thanks [15:38:58] * aude tries the dispatcher now [15:39:20] Krenair: potentially bad things as I don't think we remove it from the master list for other syncs :/ [15:39:42] so the bad sync would be propigated [15:39:46] * bd808 just realized that [15:39:58] Krenair: was there a problem with sync dir? [15:40:16] only the same issue as all the other syncs we've done in the last 40 minutes [15:40:31] mw1010, the machine with no space left, is a scap proxy [15:41:20] it has no /var partition ... probably logs? [15:41:35] woah [15:41:42] !log krenair@tin Purged l10n cache for 1.26wmf15 [15:41:42] @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ [15:41:43] <_joe_> what has happened? [15:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:53] from mw2180 [15:42:16] <_joe_> yeah it hasprobably just been reinstalled [15:42:24] Krenair: that can happen in a reimage when puppet hasn't copied the new ssh fingerprint over to tin yet [15:42:30] why is it still in mediawiki-installation? :/ [15:43:09] RECOVERY - puppet last run on mw2201 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:43:15] _joe_: mw1010 filled up it's root partition. one big log I see is /var/log/mediawiki/jobcron.log [15:43:33] okay, so that purge feed up some space on mw1010 [15:43:53] <_joe_> bd808: meh [15:44:04] <_joe_> ok I'm taking a look, I must've missed it [15:44:06] bd808, wmf16 was used in the last 30 days.. does that matter? [15:44:23] Krenair: nope. we only need l10n for active branches [15:44:32] so 19 today [15:44:42] <_joe_> bd808: still has 2.5 GB though [15:44:43] *1.26wmf19 [15:44:54] yes, it just gained 2.5GB [15:44:56] !log krenair@tin Purged l10n cache for 1.26wmf16 [15:45:01] _joe_: yeah, Krenair just freed a bunch of space [15:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:03] because I ran a purge of an old branch's l10n cache [15:45:28] and now it gained another 2.5GB because I purged another branch's l10n cache [15:46:10] <_joe_> uhm why is that server so clogged up? [15:46:15] <_joe_> I need to take a better look [15:46:23] <_joe_> /var/log is not the problem there [15:47:23] !log running sync-common on mw1010 to bring it up to date after clearing some space [15:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:28] <_joe_> we just have 25G of logs which is... ok [15:51:00] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 5 others: Try to fail over to labnet1002 - https://phabricator.wikimedia.org/T109329#1567871 (10Andrew) https://etherpad.wikimedia.org/p/labnet_failover [15:51:55] <_joe_> bd808: nutcracke 25964 nutcracker 3w REG 8,1 157536476191 11535703 /var/log/nutcracker/nutcracker.log (deleted) [15:52:28] * bd808 shakes fist in general direction of nutcracker [15:53:15] If we ever come up with a dedicated nutcracker cluster, we should name them ballet* [15:53:29] RECOVERY - Disk space on mw1010 is OK: DISK OK [15:53:41] I was gonna put nutcracker on toollabs [15:53:56] <_joe_> !log restarted nutcracker on mw1010, holding a 150 GB deleted logfile [15:53:56] experience in our test cluster (aka production :P) has shown me that might not be the best idea :) [15:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:23] <_joe_> YuviPanda: no it's just that we're using the --logging-ludicrously option [15:54:41] <_joe_> aka "log at INFO and not at DEATH level" [15:55:09] --log-level=WHONEEDSDISKSPACE [15:55:50] 6operations, 10ops-codfw: mw2180 has a faulty disk - https://phabricator.wikimedia.org/T109687#1567902 (10Papaul) 5Open>3Resolved a:3Papaul @Joe Drive replacement complete. System re-imaged complete. System is back up. [15:56:14] <_joe_> ostriches: a developer would've used --log-leve=DISKSARECHEAPANYWAYS [15:56:27] <_joe_> papaul: \o/ thanks! [15:56:31] --log-level=JUSTBUYMOREDISKS [15:57:18] RECOVERY - puppet last run on gallium is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:57:25] joe:you welcome [15:58:59] (03PS2) 10Alexandros Kosiaris: Assign fermium public IPs. IPv4 and IPv6 [dns] - 10https://gerrit.wikimedia.org/r/233414 (https://phabricator.wikimedia.org/T109923) [15:59:04] --log-level=PUSHITTOTHECLOUD [15:59:15] --daemon --no-log [16:00:03] 6operations, 10ops-codfw: payments2002 looks like it has a failed disk - https://phabricator.wikimedia.org/T105833#1567911 (10Papaul) @Jgreen sorry did not see this ticket maybe because it was not assigned to me; but I will be calling HP in the next hour to have them send me a replacement drive and I will keep... [16:00:17] bd808: --log-level=DROPBOX? [16:00:40] ostriches: sure! makes things so much easier for the NSA [16:00:51] saves time all around [16:01:33] (03PS2) 10Giuseppe Lavagetto: logrotate: make the hhvm error log readable to deployers [puppet] - 10https://gerrit.wikimedia.org/r/218829 (https://phabricator.wikimedia.org/T78310) [16:03:30] PROBLEM - puppet last run on gallium is CRITICAL puppet fail [16:05:09] !log rebooting labnet1001 [16:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:41] 6operations, 10ops-codfw: payments2002 looks like it has a failed disk - https://phabricator.wikimedia.org/T105833#1567924 (10Papaul) @Jgreen was this fixed? I check the lids on the drive i do not see any indicator showing the system has a failed drive. if it was fixed and this ticket was not updated please a... [16:10:15] 7Puppet, 6operations, 7HHVM, 5Patch-For-Review: Local hhvm error logs not readable by deployers - https://phabricator.wikimedia.org/T78310#1567927 (10Joe) @bd808 with the new patch, the files should be readable to deployers starting tomorrow. [16:10:26] 7Puppet, 6operations, 7HHVM, 5Patch-For-Review: Local hhvm error logs not readable by deployers - https://phabricator.wikimedia.org/T78310#1567928 (10Joe) 5Open>3Resolved [16:16:03] 6operations, 10ops-codfw: mc2001 not coming up after reboot - https://phabricator.wikimedia.org/T102222#1567943 (10Papaul) {F2323716} [16:16:30] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1567947 (10brion) @hashar adding notes on T104747 for testing [16:17:21] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1426066 (10brion) Per IRC discussion: this is probably ready to throw into production, but we ought to test it on beta cluster fi... [16:23:16] !log bd808@tin Purged l10n cache for 1.26wmf17 [16:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:45] !log bd808@tin Purged l10n cache for 1.26wmf18 [16:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:55] <_joe_> the gerrit bot is down [16:24:09] <_joe_> and jenkins doesn't seem to be much better off [16:24:49] RECOVERY - Disk space on mw1142 is OK: DISK OK [16:25:21] (03PS1) 10Andrew Bogott: Include rsync::server before setting up server fragments. [puppet] - 10https://gerrit.wikimedia.org/r/233428 [16:25:32] <_joe_> bd808: is that you purging caches? [16:25:41] <_joe_> I guess this is in the same situation as the other host [16:25:47] (03PS2) 10Andrew Bogott: Include rsync::server before setting up server fragments. [puppet] - 10https://gerrit.wikimedia.org/r/233428 [16:25:55] <_joe_> oblivian@restbase1001:~$ sudo puppet agent -tv [16:25:55] <_joe_> Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'reason not specified'); [16:25:58] <_joe_> Use 'puppet agent --enable' to re-enable. [16:26:08] <_joe_> who did disable puppet, why, and why with no reason? [16:26:14] _joe_: yeah I just dumped 2 more. should have freed ~5G I think [16:26:47] (03CR) 10Andrew Bogott: [C: 032] Include rsync::server before setting up server fragments. [puppet] - 10https://gerrit.wikimedia.org/r/233428 (owner: 10Andrew Bogott) [16:27:09] PROBLEM - puppet last run on labvirt1006 is CRITICAL puppet fail [16:27:14] <_joe_> seriously, WTF guys [16:27:33] _joe_: urandom was testing settings last week [16:27:39] RECOVERY - puppet last run on gallium is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:27:44] not sure if he disabled puppet for that [16:28:09] <_joe_> can't we coordinate and use hiera for that? [16:28:20] <_joe_> I find puppet disabled half of the times on the rb hosts [16:28:24] <_joe_> that's not okay [16:28:34] wouldn't there be an entry in the sudo log for it? [16:29:08] can we configure puppet to puke if you use --disable without a reason? [16:29:18] RECOVERY - puppet last run on labvirt1006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:22] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1568012 (10brion) Ok, sample files to re-run and test: VP9 source video (currently it won't transcode, but it should work if re-... [16:29:27] <_joe_> bd808: s/if you use --disable// [16:29:30] or at least fill in "$USER @ $(date)" [16:29:34] <_joe_> that's surely very possible [16:30:09] PROBLEM - puppet last run on labvirt1004 is CRITICAL puppet fail [16:30:09] 6operations, 10Beta-Cluster, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, and 2 others: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1568015 (10brion) Adding beta-cluster project for fixing/updating TMH video scaler job runner for beta cluster... [16:30:20] <_joe_> so, my point is 1) please don't 2) please input a reason 3) please don't [16:31:02] <_joe_> (or, just disable puppet if it's really needed and for a short timespan, and if you do so, do it setting a reason) [16:33:09] _joe_: you are talking to the wrong person [16:33:30] <_joe_> I'm not talking to you gabriel :) I'm talking to everybody [16:33:46] okay ;) [16:34:10] <_joe_> but now try to pretend you didn't ever disable puppet for a weekend with no reason set :P [16:34:24] <_joe_> (I did it too, it's not ok either ) [16:34:58] no, I have done that too; but, more recently I had some instances where I set a reason *and* logged it in SAL, but still got yelled at [16:34:59] _joe_: you should just change your attitude, think of it as a game! the game is called: Why is puppet disabled? [16:35:04] you get to ask 20 yes or no questions. [16:35:26] <_joe_> ottomata: no my attitude varies between ranting here and just re-enabling it no questions asked :) [16:35:31] haha [16:35:32] <_joe_> so this was my "good cop' [16:35:53] _joe_: in any case, I'll remind urandom when he's around [16:36:11] haha, bad cop changes the game name to: "WUT THE CRAP WHO REENABLED PUPPET THE WORLD IS CRUMBLING?!" [16:36:13] <_joe_> I can tell him :) [16:36:25] <_joe_> ottomata: no bad cop is I re-enable it :) [16:36:48] exactly, then the game is to figure out who. PSH you are making this game no fun [16:37:05] 6operations, 10Analytics: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#1568057 (10Southparkfan) [16:37:07] 6operations, 7Tracking: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) (tracking) - https://phabricator.wikimedia.org/T65899#1568056 (10Southparkfan) [16:39:20] bad cop is you uninstall puppet and wipe out everything it deployed and !log "Removed all puppetized things from host1001, because apparently nobody wants it running there anyways" [16:40:33] <_joe_> bblack: that's BOFH [16:40:59] <_joe_> but point taken, I should do better [16:41:45] (03PS2) 10Giuseppe Lavagetto: mediawiki: keep 12 weeks of access/error logs at max [puppet] - 10https://gerrit.wikimedia.org/r/232468 [16:42:07] happy monday, all :) [16:42:32] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: keep 12 weeks of access/error logs at max [puppet] - 10https://gerrit.wikimedia.org/r/232468 (owner: 10Giuseppe Lavagetto) [16:42:48] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL 100.00% of data above the critical threshold [5000000.0] [16:43:25] 6operations, 10ops-codfw: payments2002 looks like it has a failed disk - https://phabricator.wikimedia.org/T105833#1568108 (10Jgreen) 5Open>3Resolved a:3Jgreen >>! In T105833#1567924, @Papaul wrote: > @Jgreen was this fixed? I check the lids on the drive i do not see any indicator showing the system has... [16:43:29] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL 100.00% of data above the critical threshold [5000000.0] [16:45:06] psh its ok, partitions are moving, though i downtimed that... [16:45:27] thought* [16:54:31] bd808: yeah I forgot to do that the past few weeks... [16:54:46] (03CR) 10BBlack: [C: 04-1] "I tend to think this is a bad idea. Intentionally allowing bad syntax to work is just going to cause pain down the line. If nothing else" [debs/pybal] - 10https://gerrit.wikimedia.org/r/233043 (owner: 10Ori.livneh) [16:54:56] it's an easy step to skip. we're caught up now [16:54:58] RECOVERY - puppet last run on labvirt1004 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:56:36] I've got it mostly automated, the new schedule is more straight-forward to automate [16:58:08] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1417 bytes in 0.161 second response time [17:01:49] PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 1 failures [17:02:26] 10Ops-Access-Requests, 6operations: Replace SSH key for mholloway - https://phabricator.wikimedia.org/T110064#1568147 (10Mholloway) 3NEW [17:02:57] 6operations, 10Traffic: Switch codfw caches to tier2, being pushing some traffic through them to test - https://phabricator.wikimedia.org/T110065#1568159 (10BBlack) 3NEW [17:03:23] automate the problem(s) away [17:04:03] 6operations, 10Traffic: Switch codfw caches to tier2, begin pushing some traffic through them to test - https://phabricator.wikimedia.org/T110065#1568176 (10BBlack) [17:04:29] (03PS1) 10Andrew Bogott: Use fqdn for rsync allowed hosts. [puppet] - 10https://gerrit.wikimedia.org/r/233433 (https://phabricator.wikimedia.org/T109902) [17:05:07] (03CR) 10jenkins-bot: [V: 04-1] Use fqdn for rsync allowed hosts. [puppet] - 10https://gerrit.wikimedia.org/r/233433 (https://phabricator.wikimedia.org/T109902) (owner: 10Andrew Bogott) [17:07:06] (03PS2) 10Andrew Bogott: Use fqdn for rsync allowed hosts. [puppet] - 10https://gerrit.wikimedia.org/r/233433 (https://phabricator.wikimedia.org/T109902) [17:07:32] (03CR) 10jenkins-bot: [V: 04-1] Use fqdn for rsync allowed hosts. [puppet] - 10https://gerrit.wikimedia.org/r/233433 (https://phabricator.wikimedia.org/T109902) (owner: 10Andrew Bogott) [17:07:45] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1568185 (10GWicke) @joe, we probably want *both* a canary deploy *and* a rolling deploy in general. With RB, we tend to deploy to one node... [17:17:13] (03CR) 10Krinkle: [C: 04-1] Add all groups to bast1001, empty bastiononly group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227327 (owner: 10Alex Monk) [17:19:09] (03PS2) 10Mattflaschen: Get rid of $wgFlowOccupyPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228271 (https://phabricator.wikimedia.org/T105574) (owner: 10Matthias Mullie) [17:19:12] !log bouncing Cassandra on restbase1001 to apply temporary GC settings [17:19:16] (03PS5) 10Alex Monk: Add all groups to bast1001, empty bastiononly group [puppet] - 10https://gerrit.wikimedia.org/r/227327 [17:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:19:20] PROBLEM - puppet last run on mw1123 is CRITICAL Puppet last ran 9 days ago [17:21:20] ACKNOWLEDGEMENT - Host cr1-eqdfw is DOWN: CRITICAL - Network Unreachable (208.80.153.198) Faidon Liambotis turnup [17:21:21] ACKNOWLEDGEMENT - Host cr1-eqord is DOWN: CRITICAL - Network Unreachable (208.80.154.198) Faidon Liambotis turnup [17:21:26] bblack, where should i broadcast the htcp to? [17:22:32] (03PS3) 10Andrew Bogott: Use fqdn for rsync allowed hosts. [puppet] - 10https://gerrit.wikimedia.org/r/233433 (https://phabricator.wikimedia.org/T109902) [17:23:40] yurik: we haven't assigned or set up a new multicast address for maps, which we should do before we go configuring this and using it [17:23:50] RECOVERY - puppet last run on mw1123 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:23:59] 6operations: finish and automate data retention scripts - https://phabricator.wikimedia.org/T110066#1568234 (10ArielGlenn) 3NEW a:3ArielGlenn [17:24:00] it's non-trivial, but I'd rather not have experimental purges flowing to the primary caches either [17:24:14] (03PS4) 10Andrew Bogott: Use fqdn for rsync allowed hosts. [puppet] - 10https://gerrit.wikimedia.org/r/233433 (https://phabricator.wikimedia.org/T109902) [17:24:17] bblack, understood, is there any way for me to purge one specific page today? [17:24:37] 6operations: finish and automate data retention scripts - https://phabricator.wikimedia.org/T110066#1568243 (10ArielGlenn) [17:24:39] not easily, no. is there some critical abnormal need for it? [17:24:56] (03CR) 10Andrew Bogott: [C: 032] Use fqdn for rsync allowed hosts. [puppet] - 10https://gerrit.wikimedia.org/r/233433 (https://phabricator.wikimedia.org/T109902) (owner: 10Andrew Bogott) [17:25:25] file a phab ticket for the URL that needs purging I guess [17:25:32] bblack, its for a todays demo - https://maps.wikimedia.org/osm/pbfinfo.json is stale, was hoping to invalidate it [17:25:43] i could do anotehr deploy and rename it to something [17:26:03] i can't cash bust it because stupid software does not allow query params [17:26:07] *cache [17:26:14] or you could make the changes more than 24h before the demo, since the cache is limited to 1 day! :P [17:26:34] anyways, make a ticket so I don't forget. I have things to do and a meeting coming up... [17:26:37] bblack, that's against every principal i hold dear! [17:26:58] its ok, don't worry about it, will do an extra depl [17:27:09] RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:29:25] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with public IP - https://phabricator.wikimedia.org/T109890#1568269 (10Dzahn) Yes, we have tested what we wanted to test on fermium. You can go ahead, thanks for taking it. [17:31:51] (03PS1) 10BBlack: Switch codfw to tier2 [puppet] - 10https://gerrit.wikimedia.org/r/233438 (https://phabricator.wikimedia.org/T110065) [17:35:11] (03PS1) 10Cscott: Always use VRS to configure Visual Editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233439 [17:37:52] (03PS1) 10Andrew Bogott: Allow nova user to rsync as root. [puppet] - 10https://gerrit.wikimedia.org/r/233440 (https://phabricator.wikimedia.org/T109902) [17:42:06] (03PS5) 10EBernhardson: Fix mwgrep to work without dynamic scripting [puppet] - 10https://gerrit.wikimedia.org/r/232193 (https://phabricator.wikimedia.org/T108151) [17:43:32] (03PS5) 10Filippo Giunchedi: cassandra: WIP support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) [17:50:22] 6operations, 10ops-codfw: ms-be2006 failed disk - https://phabricator.wikimedia.org/T108340#1568364 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi done ``` /dev/sdg1 1.9T 1.1T 811G 57% /srv/swift-storage/sdg1 /dev/sdl1 1.9T 1.1T 827G 56% /srv/swift-storage/sdl1 ``` [17:51:18] (03CR) 10Cscott: "No, not really!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220061 (owner: 10Cscott) [17:51:47] (03Abandoned) 10Cscott: Set $wgVisualEditorParsoidDomain for Parsoid v2 API. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220061 (owner: 10Cscott) [17:54:39] RECOVERY - puppet last run on ms-be2009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:59:39] (03PS1) 10Muehlenhoff: Add hiera data for swift proxies and backends [puppet] - 10https://gerrit.wikimedia.org/r/233443 (https://phabricator.wikimedia.org/T104965) [18:00:23] (03PS2) 10Andrew Bogott: Allow nova user to rsync as root. [puppet] - 10https://gerrit.wikimedia.org/r/233440 (https://phabricator.wikimedia.org/T109902) [18:02:50] (03PS1) 10ArielGlenn: update code for collection of minions with given grain [software] - 10https://gerrit.wikimedia.org/r/233445 [18:02:52] (03PS1) 10ArielGlenn: check mysqlconf for log retention and skip explicit logfiles check [software] - 10https://gerrit.wikimedia.org/r/233446 [18:02:54] (03PS1) 10ArielGlenn: allow for more than one mysql conf file, audit all of /a [software] - 10https://gerrit.wikimedia.org/r/233447 [18:02:56] (03PS1) 10ArielGlenn: break some files out into modules, add rulestore util [software] - 10https://gerrit.wikimedia.org/r/233448 [18:02:58] (03PS1) 10ArielGlenn: turn retention modules into a package [software] - 10https://gerrit.wikimedia.org/r/233449 [18:03:00] (03PS1) 10ArielGlenn: remove dup copy of Runner class; read confs from disk, not stdin [software] - 10https://gerrit.wikimedia.org/r/233450 [18:03:02] (03PS1) 10ArielGlenn: write config files to subdir in the salt file_root [software] - 10https://gerrit.wikimedia.org/r/233451 [18:03:04] (03PS1) 10ArielGlenn: use cp.get_file to retrieve configs [software] - 10https://gerrit.wikimedia.org/r/233452 [18:03:06] (03PS1) 10ArielGlenn: turn files auditor into salt module [software] - 10https://gerrit.wikimedia.org/r/233453 [18:03:08] (03PS1) 10ArielGlenn: logs audit via salt module [software] - 10https://gerrit.wikimedia.org/r/233454 [18:03:10] (03PS1) 10ArielGlenn: homes audit done via salt module [software] - 10https://gerrit.wikimedia.org/r/233455 [18:03:12] (03PS1) 10ArielGlenn: remove now unused auditor.py, move static methods into module [software] - 10https://gerrit.wikimedia.org/r/233456 [18:03:14] (03PS1) 10ArielGlenn: move rule static methods out to a seperate utils file [software] - 10https://gerrit.wikimedia.org/r/233457 [18:03:16] (03PS1) 10ArielGlenn: get rid of executor and related vars for most files [software] - 10https://gerrit.wikimedia.org/r/233458 [18:03:18] (03PS1) 10ArielGlenn: file examiner: use salt module for remote cmd [software] - 10https://gerrit.wikimedia.org/r/233459 [18:03:20] (03PS1) 10ArielGlenn: convert dir examiner to use salt module [software] - 10https://gerrit.wikimedia.org/r/233460 [18:03:22] (03PS1) 10ArielGlenn: last 'executor' code removed, all salt module now [software] - 10https://gerrit.wikimedia.org/r/233461 [18:03:24] (03PS1) 10ArielGlenn: move cli static methods out to separate file [software] - 10https://gerrit.wikimedia.org/r/233462 [18:03:26] (03PS1) 10ArielGlenn: move readline completion and ignored entry lists out to separate files [software] - 10https://gerrit.wikimedia.org/r/233463 [18:03:28] (03PS1) 10ArielGlenn: cleanup unused function, unneeded special class instantiations [software] - 10https://gerrit.wikimedia.org/r/233464 [18:03:30] (03PS1) 10ArielGlenn: clean up ignore list code [software] - 10https://gerrit.wikimedia.org/r/233465 [18:03:32] (03PS1) 10ArielGlenn: use yaml instead of python for global config, ignore list [software] - 10https://gerrit.wikimedia.org/r/233466 [18:03:34] (03PS1) 10ArielGlenn: don't modify sys.path, convert remaining execs into local and remote [software] - 10https://gerrit.wikimedia.org/r/233467 [18:03:36] (03PS1) 10ArielGlenn: clean up names and paths, remove last runpy calls, use more yaml [software] - 10https://gerrit.wikimedia.org/r/233468 [18:03:38] (03PS1) 10ArielGlenn: enable directory prompt in cli in a few places in the menu [software] - 10https://gerrit.wikimedia.org/r/233469 [18:03:40] (03PS1) 10ArielGlenn: some pylint fixes [software] - 10https://gerrit.wikimedia.org/r/233470 [18:03:42] (03PS1) 10ArielGlenn: remove old audit_files.py, no longer needed, remove unused Rule class remove unused timeout from local audit methods remove some dead attrs and args, clean up handling of ignore dicts [software] - 10https://gerrit.wikimedia.org/r/233471 [18:03:44] (03PS1) 10ArielGlenn: more pylint/pep8 cleanup [software] - 10https://gerrit.wikimedia.org/r/233472 [18:03:46] (03PS1) 10ArielGlenn: more pylint: move dup code for file ignore check into its own method take care of unused imports in borrowed code for 'magic' more pylint and misc cleanups [software] - 10https://gerrit.wikimedia.org/r/233473 [18:03:48] (03PS1) 10ArielGlenn: pep8 fixup: remove extra spaces pylint cleanup: work around no method overrides in python [software] - 10https://gerrit.wikimedia.org/r/233474 [18:03:50] (03PS1) 10ArielGlenn: fix up for install via setup.py, bug fixes [software] - 10https://gerrit.wikimedia.org/r/233475 [18:03:52] (03PS1) 10ArielGlenn: bit more pylint [software] - 10https://gerrit.wikimedia.org/r/233476 [18:03:54] (03PS1) 10ArielGlenn: retention emails fixes: [software] - 10https://gerrit.wikimedia.org/r/233477 [18:03:56] (03PS1) 10ArielGlenn: retention: split setup.py into two files, for clients and master [software] - 10https://gerrit.wikimedia.org/r/233478 [18:04:01] what... [18:04:25] * apergos runs [18:04:38] sorry, clearing out a backlog I'd been sitting on for a long tmie [18:04:58] cause I was chicken [18:07:06] fortunately now you are rabbit [18:07:40] apergos: I see like 10 lint patches, could have merged into one eventually :P [18:07:44] go forth and multiply? [18:07:53] I squashed a bunch already [18:08:02] and how many lint patches can you have in one before it's not readable [18:10:17] (03PS2) 10Yuvipanda: Revert "base: ensure => absent on 'command-not-found'" [puppet] - 10https://gerrit.wikimedia.org/r/233156 (owner: 10Alex Monk) [18:10:51] eh true :) [18:12:37] (03CR) 10Faidon Liambotis: "What's the rationale for reverting?" [puppet] - 10https://gerrit.wikimedia.org/r/233156 (owner: 10Alex Monk) [18:12:47] (03CR) 10Faidon Liambotis: [C: 04-1] Revert "base: ensure => absent on 'command-not-found'" [puppet] - 10https://gerrit.wikimedia.org/r/233156 (owner: 10Alex Monk) [18:13:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I was wondering about that too. I kind of hate command-not-found too." [puppet] - 10https://gerrit.wikimedia.org/r/233156 (owner: 10Alex Monk) [18:13:48] (03CR) 10Yuvipanda: [C: 031] "- Affects labs, no labs users were asked" [puppet] - 10https://gerrit.wikimedia.org/r/233156 (owner: 10Alex Monk) [18:16:57] 6operations, 10Security-Reviews: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1568517 (10Moushira) @Dzahn, please check: http://wikimedia.limeservice.org.. [18:18:53] lol [18:19:54] (03CR) 10Giuseppe Lavagetto: "Just to get things straight:" [puppet] - 10https://gerrit.wikimedia.org/r/233156 (owner: 10Alex Monk) [18:20:48] (03CR) 10Faidon Liambotis: "We don't typically ask Labs users for such things -- never have and probably never will unless we diverge Labs from production significant" [puppet] - 10https://gerrit.wikimedia.org/r/233156 (owner: 10Alex Monk) [18:20:57] (03CR) 10Alex Monk: "Given that it actively prevents the package from being installed, you really should have at least asked." [puppet] - 10https://gerrit.wikimedia.org/r/233156 (owner: 10Alex Monk) [18:21:23] (03CR) 10Dzahn: "Ori said that it is also using port 6379 for outgoing connections because there is already an unrelated redis on tin needed for trebuchet " [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [18:21:34] (03CR) 10Yuvipanda: "Sigh. It's 'just labs'." [puppet] - 10https://gerrit.wikimedia.org/r/233156 (owner: 10Alex Monk) [18:21:57] what do you mean, YuviPanda? [18:22:33] we didn't poll employees with shell accounts in prod either [18:22:46] (03CR) 10Dzahn: "and he also said that both of the rulesets actually apply to nutcracker, not redis, so we should merge the 4 rules into 2 rules listing bo" [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [18:22:58] heh [18:23:18] this clearly needs an RfC! [18:25:41] bblack: https://www.mediawiki.org/wiki/Requests_for_comment/Command_not_found:_Is_it_or_isn%27t_it [18:28:07] this may deserve special mention somewhere as the most canonical example of time wasted bikeshedding a trivial issue on ops/puppet I've ever seen :P [18:28:23] 6operations, 10hardware-requests, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1568576 (10RobH) a:5jcrespo>3Cmjohnson It shows delivered on 2015-08-20. https://rt.wikimedia.org/Ticket/Display.html?id=9524 has not been updated, even though tracking shows delivere... [18:31:07] !log reloading backup LVS pybals for BlankPage change ( https://gerrit.wikimedia.org/r/#/c/233053/ ) [18:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:33:03] akosiaris any update on fermium? :), [18:38:54] 6operations, 10Analytics: Increase HADOOP_HEAPSIZE (-Xmx) for hive-server2 - https://phabricator.wikimedia.org/T76343#1568625 (10Ottomata) [18:44:32] 6operations, 7Pybal: Configure pybal ulimits higher - https://phabricator.wikimedia.org/T110091#1568652 (10BBlack) 3NEW [18:45:11] <_joe_> bblack: and you missed the live discussion between me and yuvi [18:45:14] <_joe_> :P [18:46:35] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with public IP - https://phabricator.wikimedia.org/T109890#1568671 (10Dzahn) Actually, no, there was one thing that i still had to copy. the data that Tim uploaded of the old staff list for T109395. Can we save that first please? [18:49:38] (03PS1) 10BBlack: pybal: raise open files ulimit to 10240 [puppet] - 10https://gerrit.wikimedia.org/r/233484 (https://phabricator.wikimedia.org/T110091) [18:50:06] (03PS2) 10BBlack: pybal: raise open files ulimit to 10240 [puppet] - 10https://gerrit.wikimedia.org/r/233484 (https://phabricator.wikimedia.org/T110091) [18:50:29] (03CR) 10BBlack: [C: 032 V: 032] pybal: raise open files ulimit to 10240 [puppet] - 10https://gerrit.wikimedia.org/r/233484 (https://phabricator.wikimedia.org/T110091) (owner: 10BBlack) [18:50:43] (03PS4) 10Dzahn: enable codfw bastion for non-ops user groups [puppet] - 10https://gerrit.wikimedia.org/r/222519 [18:53:03] (03PS1) 10Andrew Bogott: Allow labcontrol1001 rsync access to the labvirt nodes. [puppet] - 10https://gerrit.wikimedia.org/r/233485 [18:53:44] (03CR) 10Dzahn: [C: 032] enable codfw bastion for non-ops user groups [puppet] - 10https://gerrit.wikimedia.org/r/222519 (owner: 10Dzahn) [18:53:48] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with public IP - https://phabricator.wikimedia.org/T109890#1568752 (10akosiaris) sure, go ahead. I haven't yet reinstalled the box [18:53:52] (03PS3) 10Dzahn: add parsoid/ocg/bastiononly user groups to hooft [puppet] - 10https://gerrit.wikimedia.org/r/222522 [18:54:37] (03CR) 10Dzahn: [C: 032] add parsoid/ocg/bastiononly user groups to hooft [puppet] - 10https://gerrit.wikimedia.org/r/222522 (owner: 10Dzahn) [18:58:06] (03PS6) 10Muehlenhoff: ferm rules for nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [18:58:29] is etherpad down? [18:58:31] !log reloading primary LVS pybals for BlankPage change ( https://gerrit.wikimedia.org/r/#/c/233053/ ) + ulimit fixup ( https://gerrit.wikimedia.org/r/#/c/233484/ ) [18:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:59:53] niedzielski: not for me, I got a 503 earlier today though [19:00:07] niedzielski: also, in 1 minute all of the opsen will be using it, so they'll see if it's down then :) [19:00:15] greg-g: ah, back now! [19:04:07] niedzielski: it's flakey for this meeting of about 16 or so [19:04:19] greg-g: seems that way :/ [19:05:20] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [19:06:45] (03PS2) 10Ori.livneh: Add HttpConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/232941 (owner: 10Giuseppe Lavagetto) [19:10:38] Krenair: changes merged that make the bastion host accounts more consistent [19:10:54] (03PS2) 10Andrew Bogott: Allow labcontrol1001 rsync access to the labvirt nodes. [puppet] - 10https://gerrit.wikimedia.org/r/233485 [19:11:29] (03PS2) 10Chad: Assign swift roles via ENC [puppet] - 10https://gerrit.wikimedia.org/r/200625 (https://phabricator.wikimedia.org/T91553) (owner: 10Thcipriani) [19:12:01] (03CR) 10Andrew Bogott: [C: 032] Allow labcontrol1001 rsync access to the labvirt nodes. [puppet] - 10https://gerrit.wikimedia.org/r/233485 (owner: 10Andrew Bogott) [19:12:27] Someone mind looking at 200625? Pretty trivial, only affects staging where it's already running (and has been for months) [19:12:55] JohnFLewis: yes. mutante asked to copy some files off first before reinstalling [19:13:05] so I am on waiting mode [19:13:21] akosiaris: yeah saw, we discussed that briefly :) [19:15:40] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:18:20] akosiaris: copied right now. done [19:18:55] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with public IP - https://phabricator.wikimedia.org/T109890#1568921 (10Dzahn) copied and done. it can be reinstalled now [19:20:35] (03PS1) 10Alexandros Kosiaris: backups: remove the temporary helium nfs if clause [puppet] - 10https://gerrit.wikimedia.org/r/233489 [19:21:55] (03CR) 10Alexandros Kosiaris: [C: 032] backups: remove the temporary helium nfs if clause [puppet] - 10https://gerrit.wikimedia.org/r/233489 (owner: 10Alexandros Kosiaris) [19:23:49] PROBLEM - puppet last run on helium is CRITICAL Puppet has 2 failures [19:25:19] RECOVERY - bacula director process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir [19:25:50] RECOVERY - puppet last run on helium is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:33:06] (03CR) 10Dduvall: "I like the approach that uses separate sockets for a few reasons:" [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [19:37:01] (03PS1) 10Matanya: access: update key for Mholloway [puppet] - 10https://gerrit.wikimedia.org/r/233582 [19:40:19] (03PS1) 10Alexandros Kosiaris: Add ip6 address on helium [puppet] - 10https://gerrit.wikimedia.org/r/233584 [19:40:35] (03CR) 10Mholloway: [C: 031] access: update key for Mholloway [puppet] - 10https://gerrit.wikimedia.org/r/233582 (owner: 10Matanya) [19:45:06] (03PS1) 10Ori.livneh: Make FileConfigurationObserver easier to extend [debs/pybal] - 10https://gerrit.wikimedia.org/r/233586 [19:46:56] etherpad.wikimedia.org giving me "Error: 503, Service Unavailable at Mon, 24 Aug 2015 19:46:11 GMT " [19:46:57] (03PS3) 10Ori.livneh: Add HttpConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/232941 (owner: 10Giuseppe Lavagetto) [19:47:00] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL - Socket timeout after 10 seconds [19:48:08] An etherpad reload gave 'SyntaxError: expected expression, got ')' in https://etherpad.wikimedia.org/javascripts/lib/ep_etherpad-lite/static/js/ace2_common.js?callback=require.define at line 1'' [19:49:24] etherpad briefly back up, then down again [19:50:59] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.005 second response time [19:51:24] 6operations, 6Labs, 10Labs-Infrastructure: disk space on labvirt1007 - https://phabricator.wikimedia.org/T109752#1568982 (10hashar) [19:52:50] (03PS4) 10Ori.livneh: Add HttpConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/232941 (owner: 10Giuseppe Lavagetto) [19:53:22] <_joe_> ori: /win 16 [19:53:26] <_joe_> err [19:59:20] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK Less than 1.00% above the threshold [1000000.0] [20:00:04] gwicke cscott arlolra subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150824T2000). Please do the needful. [20:01:37] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/233586 (owner: 10Ori.livneh) [20:03:01] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 28.57% of data above the critical threshold [500.0] [20:03:21] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 1 below the confidence bounds [20:08:24] what's with the 5xx? [20:08:32] jouncebot: next [20:08:32] In 0 hour(s) and 51 minute(s): OAuth (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150824T2100) [20:08:47] https://gdash.wikimedia.org/dashboards/reqerror/ [20:09:08] I guess 1K isn't insane for a short spike, but that's a rather wide spike [20:09:13] bblack, how hard is it to have a "dummy" value in the middle of the URL, removable by varnish? e.g. /osm//z/x/y.png [20:09:42] why do you want that? [20:09:49] this way we could track the usage of the maps by different apps [20:10:00] by what apps? [20:10:21] users would implement different labs-based utils that use maps [20:10:30] and we can track which app uses what [20:10:38] (for KPI) [20:10:45] can they simply set a request-header that we can send on to hadoop via X-Analytics? [20:10:52] <_joe_> yurik_: there is a feature called "referer" in HTTP [20:10:59] <_joe_> :) [20:11:00] or that! [20:11:19] hmm, you are right, referrer is there too :) But, in case it is used on the same server? [20:11:31] referer is not about the server, it's about the url [20:11:32] as for request header - it depends if the 3rd party libs support that [20:11:52] yurik_: or user agent? [20:12:02] user agent will always be the browser [20:12:14] ottomata: except these are probably browser-based "apps", so we don't want to try to clobber the browser's own UA (if that's even possible) [20:12:15] <_joe_> ottomata: he didn't mean "apps" in that sense [20:12:25] its not an app, (although it could be in theory), its the different labs-based tools [20:12:46] <_joe_> so a referer is the correct answer :) [20:12:47] the idea was to encourage everyone to set headers, otherwise it won't work :) [20:12:50] ayek [20:13:02] oh, btw, we now have static service too! [20:13:02] in any case, as a general rule I'd prefer we look for ways to do this that don't much with the URL space, so that varnish complexity stays low. [20:13:20] s/much/muck/ [20:14:43] bblack, i hear you, will pass it on. [20:14:44] check out http://ns512621.ip-167-114-156.net/osm/6/44.8247/4.9981/1000/600.png [20:14:51] basically its an image service! [20:15:03] you set the center and image size and zoom [20:15:24] neat [20:16:11] as long as we're talking about general "please don't do this to varnish" topics: don't commit code that intentionally cache-busts via query-params, it's a bad anti-pattern [20:16:40] we can figure out what to really vary on, or use cache-control headers, to handle cases where cache-busting query params seem desirable at first glance. [20:18:01] PROBLEM - salt-minion processes on kafka1022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:18:08] yurik_: also another consideration re: maps caching and TTLs and purging: we're not making an s-maxage distinction between internal and external caches right now, like we do for wiki text. [20:18:29] (so caches outside of our control that can't be purged would also be able to hold things for a day, in theory) [20:20:27] bblack, yes, agree about cache busting - that's why i was thinking we could do the URL normalization in varnish (today's minor thing excluding - its a minor one off thing). I think one day should be good enough, plus I generate ETag, which might make it easier for other usecases [20:21:20] !log updated Parsoid to version 0b2fbae7 [20:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:46] bblack, we could keep one day for external, and keep longer internally and purge by index [20:23:06] this is a different discussion, i will keep pondering on it [20:23:15] re: "normalization": it's better if the app simply uses a properly normalized URL scheme to begin with [20:23:38] yep, will do [20:23:38] if some "foo" doesn't actually affect rendered content and shouldn't impact/split the cache, then it "foo" shouldn't be part of the URL [20:24:06] those analytics ppl! always want it difficult :) [20:24:36] most of our analytics doesn't affect the URL. we parse other headers and encode the necessary bits into X-Analytics [20:27:10] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [20:34:01] RECOVERY - salt-minion processes on kafka1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:35:05] bblack, do you know why maps died an hour ago? https://maps.wikimedia.org/static [20:35:12] the backend is still up [20:36:43] it's probably because I phoned the datacenter and told them to cut the power cords to those machines with scissors [20:36:52] i knew it! [20:37:06] Error: 503, Service Unavailable at Mon, 24 Aug 2015 20:36:48 GMT [20:37:13] are you sure the backend that the caches hit is still working? [20:39:24] (03CR) 10Ori.livneh: [C: 032] Make FileConfigurationObserver easier to extend [debs/pybal] - 10https://gerrit.wikimedia.org/r/233586 (owner: 10Ori.livneh) [20:39:30] bblack, i just did ssh -L 6001:localhost:4000 maps-test2003.codfw.wmnet -- and a few others - they are all ok [20:39:40] if i browse to localhost:6001 [20:39:41] (03Merged) 10jenkins-bot: Make FileConfigurationObserver easier to extend [debs/pybal] - 10https://gerrit.wikimedia.org/r/233586 (owner: 10Ori.livneh) [20:40:11] 18 FetchError c no backend connection [20:41:23] bblack, what server are you hitting? [20:41:32] root@cp1043:/etc/varnish# curl -vI http://kartotherian.svc.codfw.wmnet:4000/ [20:41:35] * Hostname was NOT found in DNS cache [20:41:37] * Trying 10.2.1.13... [20:41:40] * connect to 10.2.1.13 port 4000 failed: Connection refused [20:42:52] they all work as maps-test200x, so something with the LVS service there? [20:43:14] bblack, no idea what that ip is -- it resolves as kartotherian.svc.codfw.wmnet -- i guess its lvs [20:43:28] i have no access to lvs [20:43:43] lvs just routes to the defined backends [20:44:15] I see puppet is disabled, can I turn it back on? [20:44:22] bblack, what server? [20:44:27] maps-test2001 [20:44:31] it should be on [20:44:34] i think :) [20:44:43] it's been off for a very long time :P [20:44:54] bblack, check with akosiaris [20:45:12] he is the one who set it up, maybe there is reason? the 2001 has redis [20:46:15] and i suspect all other 200{2-4} are ok for puppets [20:47:02] I'm looking at other side-issues now [20:51:30] 2015-08-24 20:49:50.900565 [kartotherian_4000] Could not load configuration URL http://config-master.codfw.wmnet/pybal/codfw/kartotherian: 404 Not Found [20:52:08] I think that went into effect because of an earlier pybal config change, which was pending from some point in the past, and then hit when I restarted LVSes for other things... [20:53:19] or something of that nature, anyways [20:54:09] actually, I don't even see kartotherian on config-master at all, like it's never been defined [20:54:15] perhaps it was manually-defined at some point? [20:54:30] or it was never committed and someone reset it in the git repo [20:56:33] yurik_: should be fixed now [20:56:42] bblack, thanks! what was it? [20:57:20] like I said above, there was no LVS config URL at all, to list the maps-test200x hosts as the LVS backends for kartotherian.svc.codfw.wmnet [20:57:46] it must have been running in some uncommitted/unofficial way previously, and then it died when I pushed an unrelated config change -> restart of the relevant pybal daemon [20:58:06] (a couple of hours ago) [20:58:13] gotcha [20:58:16] thx!!! [20:58:46] sorry i didn't see your comments above [20:58:58] 6operations, 10MediaWiki-File-management, 6Multimedia: Thumbnail render throttling should not result in HTTP 500 - https://phabricator.wikimedia.org/T110109#1569124 (10Tgr) 3NEW [21:03:48] (03PS1) 10Andrew Bogott: use_chroot => no so that we can preserve file ownership [puppet] - 10https://gerrit.wikimedia.org/r/233599 [21:03:55] (03CR) 10jenkins-bot: [V: 04-1] use_chroot => no so that we can preserve file ownership [puppet] - 10https://gerrit.wikimedia.org/r/233599 (owner: 10Andrew Bogott) [21:05:00] yurik_: while you're here, is now a good time to ask what we're going to do about map label languages? :) [21:05:18] i'm always here :D [21:05:26] just not connected well :) [21:05:46] bblack, we have several options: the image way and the vector way [21:06:03] the image way - transparencies on top of the maps [21:06:14] well I assume you'd want to keep the vector ones singular, and render multiple pngs (one per lang) from a given source vector [21:06:40] this way i think they will be smaller [21:06:53] leaflet can do multile layers [21:06:57] *multiple [21:07:07] where you just say put this on top of that [21:07:08] you mean you're going to strip the labels from the base tiles, and put them in an overlay PNG that's separate? [21:07:13] yep [21:07:17] one method [21:07:31] or possibly alternative technology alltogether, like what wikiminiatlas is doing [21:07:36] which i think draws it in client [21:07:45] ok [21:07:51] and we could hack it to show the base layer without labels at all [21:08:04] if that actually works and doesn't have significant caveats, it would probably be better [21:08:33] vs a future where the count of images the caches have to deal with is num_tiles * ($wikilangs + 1) heh [21:08:33] and lastly - the vector way would be to have all langs in a single tile, and dynamically decide what to show [21:09:04] the vector way being make the selection at png render time, and put the language code in the image URL pathname, right? [21:09:19] no, as in send the vector tile to the client [21:09:21] webgl [21:09:29] and use js to decide what to show [21:09:36] oh sure, but I was under the impression we can't rely on vector support in clients yet right? [21:09:43] correct :( [21:09:55] will need to check with mapbox - they were very activelly working on that [21:10:01] so that's out. or at least, it would need a fallback, so we still have to solve the other problem anyways [21:10:22] plus there is an issue of actually adding data - they don't support hstore in C++ mapnik :( [21:10:32] drawing in the client is probably not an option for those older clients either [21:10:41] as for fallbacks - we could always reduce the number of langs for the server-side [21:10:53] e.g. say that we will only show N langs [21:10:58] main ones [21:11:04] whatever that N is [21:11:32] it should be much easier now - no need to support russian as they plan to block WP [21:12:04] a) that's not a great topic to bring up here and b) that doesn't impact us on a technical decision level :P [21:12:23] yeah i know, was making a sad joke [21:12:37] springle, wanted to ping you about https://phabricator.wikimedia.org/T94427 . I did not realize at that time (sorry), but it turns out there are four wikis that have Echo tables on the standard cluster. Those also need the change applied. I believe that is causing https://phabricator.wikimedia.org/T107835 which is causing notifications not to be delivered. [21:13:03] ooooooh [21:13:07] So the job queue thing was a red herring? [21:13:11] RoanKattouw, yes, AFAICT [21:13:20] bblack, so anyway, MaxSem and I were thinking about this for a while, i think our #1 deliverable is "international" (english+local) [21:13:22] I mean it is failing in the job queue, but it would either way AFAICT. [21:13:55] yurik_: meaning two languages total, one is like what you currently have for "international", and the other is all english? [21:14:15] bblack, no, as in longer labels that include both names [21:14:19] ah [21:14:44] simple first, complex later :) [21:14:56] (03PS2) 10Andrew Bogott: use_chroot => no so that we can preserve file ownership [puppet] - 10https://gerrit.wikimedia.org/r/233599 [21:14:56] that makes sense as a baseline choice I think, even though someone will probably say it's english-ist :P [21:14:58] (03PS1) 10Andrew Bogott: Add sus-migrate, yet another attempt to support cold migration. [puppet] - 10https://gerrit.wikimedia.org/r/233602 (https://phabricator.wikimedia.org/T109902) [21:15:09] I personally think that we should should ever serve only 2 things: default layer with localized languages as PNG and multilingual PBFs. multilingual PNGs aren't worth the effort [21:15:28] yurik_, bblack ^^^ [21:15:43] i think we should also include non-labeled one [21:15:51] (or scarcely labeled one) [21:15:58] PBFs means vector rendered by the browser? [21:16:01] yep [21:16:05] ok [21:16:28] it would be an interesting datapoint to know which browser revs can't do vector and what kind of wmf client percentages they make up [21:16:30] i suspect that it won't really be that much effort to tell you the truth [21:16:46] anything that has IE in it [21:17:12] well, a full solution in the PNGs-only world has downside of multiplying the total dataset size for the caches [21:17:41] IE is a lot heh [21:17:44] bblack, i think that even though the "theoretical" number of combinations is huge, the practical size will be managable [21:17:46] (03CR) 10Andrew Bogott: [C: 032] use_chroot => no so that we can preserve file ownership [puppet] - 10https://gerrit.wikimedia.org/r/233599 (owner: 10Andrew Bogott) [21:17:47] not even 11 or Edge? [21:18:11] no idea really, will need to research. I think anything that has a reasonable webgl support will do [21:18:44] i think that even if we do do all the laguages, the real data size there will be much smaller [21:18:45] google says 11 has webgl [21:19:06] re: data size, possibly, but still, it is a multiplier [21:19:10] IE is 12% of PVs [21:20:15] I would say vectors would work in 60-70% of cases [21:20:23] I wish we could do a patch that would remotely exploit -> upgrade IE[78]/XP users to some other browser :P [21:20:42] bblack, format c: && install linux [21:20:57] for XP users, you can reliably deliver only coup de grace, I'm afraid [21:21:26] MaxSem, i will show you when you are done with the CR - it is really easy to dynamically set up the styling to generate multiple languages. The bigger problem is getting all the langs into the vectors [21:21:33] curl http://freecode.com/projects/debtakeover [21:21:50] yurik_, installing Linux will not work because nothing will work without Bliss.jpg [21:22:36] MaxSem, finish up the CR and lets do the KT! [21:22:42] :-P [21:23:07] I wonder how controversial it would be to do a banner campaign that only targetrs IE[78]/XP UAs, with a message that tries to encourage them to either upgrade their Windows OS, or install one of several alternate browsers (with links to, say, FF, Chrome, and Opera) [21:24:50] bblack, sure! and while we are at it, we could promote some political party, etc :) [21:24:58] i'm sure it will be highly targeted :) [21:25:13] yeah I know we don't want to get into propping up any one side in the browser wars [21:25:37] lynx! [21:25:39] but IE[78]/XP is legitimately a horrible option for any user, and MS doesn't support the platform or offer an upgrade path without upgrading the whole OS [21:25:51] "even Microsoft warned yoy 5 years ago" http://apcmag.com/microsoft-warns-stop-using-ie6-ie7-now.htm/ [21:25:59] and it's an edge case that holds us back (and everyone else on the internet too) [21:26:19] mutante keeps coming up with all those fun links! lurking forevah! [21:27:15] i think google tried that a few times, didn't they? [21:28:20] i'm still very happy with the static service :) maps.wikimedia.org/osm/6/44.8247/4.9981/1000/600.png [21:28:45] only took 20 min to implement thanks to a npm lib! [21:29:29] that combined with a graph should be very cool! [21:30:20] where in the world are you: https://maps.wikimedia.org/osm/6/48.8247/8.9981/10/10.png :P [21:32:21] (03CR) 10Ori.livneh: [C: 032] Add HttpConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/232941 (owner: 10Giuseppe Lavagetto) [21:32:35] (03Merged) 10jenkins-bot: Add HttpConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/232941 (owner: 10Giuseppe Lavagetto) [21:36:22] yurik_: bblack puppet is disabled because of the long running tilerator jobs from yurik_'s homedir. yurik_ is that done ? can we please migrate to the system one ? [21:37:33] akosiaris, if it won't wipe redis, should be ok [21:37:49] it kinda stalled now, i will need to restart a few things, but should be ok [21:38:03] yurik_: it wont. ok then. I 'll stop the one from your homedir and start the system one then [21:38:26] akosiaris, sure [21:38:47] (03PS2) 10Andrew Bogott: Add sus-migrate, yet another attempt to support cold migration. [puppet] - 10https://gerrit.wikimedia.org/r/233602 (https://phabricator.wikimedia.org/T109902) [21:38:55] akosiaris, have you enabled "disable" func? [21:38:58] i still need that [21:39:07] because sometimes i need to pause the execution [21:40:00] (03Abandoned) 10Andrew Bogott: Allow nova user to rsync as root. [puppet] - 10https://gerrit.wikimedia.org/r/233440 (https://phabricator.wikimedia.org/T109902) (owner: 10Andrew Bogott) [21:40:49] (03PS1) 10Gergő Tisza: Set OAuth readoonly on Beta for DB migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233606 (https://phabricator.wikimedia.org/T108648) [21:40:51] (03PS1) 10Gergő Tisza: Change OAuth central wiki on Beta to metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233607 (https://phabricator.wikimedia.org/T108648) [21:40:53] (03PS1) 10Gergő Tisza: Set OAuth readonly for DB migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233608 (https://phabricator.wikimedia.org/T108648) [21:40:55] (03PS1) 10Gergő Tisza: Change OAuth central wiki to metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233609 (https://phabricator.wikimedia.org/T108648) [21:40:57] (03PS1) 10Gergő Tisza: End OAuth migration; reenable writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233610 (https://phabricator.wikimedia.org/T108648) [21:41:14] yurik_: it's what I wanted to enable and was waiting for the jobs to finish [21:41:51] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK Less than 1.00% above the threshold [1000000.0] [21:41:59] !log enabled puppet on maps-test200{1,2,3,4}.codfw.wmnet [21:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:43:51] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [21:44:41] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK Less than 1.00% above the threshold [1000000.0] [21:44:51] (03PS2) 10BBlack: Switch codfw to tier2 [puppet] - 10https://gerrit.wikimedia.org/r/233438 (https://phabricator.wikimedia.org/T110065) [21:44:52] (03PS1) 10BBlack: Disable IPSec monitoring temporarily [puppet] - 10https://gerrit.wikimedia.org/r/233616 [21:45:30] akosiaris, enabling tilerator is good, but i will need a way to disable the service once in a while, and run it manually [21:45:35] (03PS1) 10Dzahn: mailman: add cronjob to delete old held messages [puppet] - 10https://gerrit.wikimedia.org/r/233617 (https://phabricator.wikimedia.org/T109838) [21:45:50] (03PS2) 10BBlack: Disable IPSec monitoring temporarily [puppet] - 10https://gerrit.wikimedia.org/r/233616 (https://phabricator.wikimedia.org/T110065) [21:45:52] (03PS3) 10BBlack: Switch codfw to tier2 [puppet] - 10https://gerrit.wikimedia.org/r/233438 (https://phabricator.wikimedia.org/T110065) [21:46:24] (03CR) 10jenkins-bot: [V: 04-1] mailman: add cronjob to delete old held messages [puppet] - 10https://gerrit.wikimedia.org/r/233617 (https://phabricator.wikimedia.org/T109838) (owner: 10Dzahn) [21:49:12] (03CR) 10John F. Lewis: [C: 04-1] mailman: add cronjob to delete old held messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/233617 (https://phabricator.wikimedia.org/T109838) (owner: 10Dzahn) [21:52:09] (03CR) 10Gergő Tisza: [C: 032] Set OAuth readoonly on Beta for DB migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233606 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [21:52:15] (03Merged) 10jenkins-bot: Set OAuth readoonly on Beta for DB migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233606 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [21:54:50] !log tgr@tin Synchronized wmf-config/CommonSettings-labs.php: set beta OAuth to readonly (duration: 00m 13s) [21:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:55:07] !log tgr@tin Synchronized wmf-config/CommonSettings-labs.php: set beta OAuth to readonly (duration: 00m 13s) [21:56:34] I'm getting some nasty warnings about how mw2180.codfw.wmnet has a different key [21:56:48] sync-file is failing because of it [21:56:48] (03PS5) 10Dzahn: fermium: add service IPs to hiera [puppet] - 10https://gerrit.wikimedia.org/r/233052 (owner: 10John F. Lewis) [21:57:33] tgr, yeah, was known earlier [21:57:45] it was reinstalled [21:57:57] (03CR) 10Dzahn: [C: 032] fermium: add service IPs to hiera [puppet] - 10https://gerrit.wikimedia.org/r/233052 (owner: 10John F. Lewis) [21:58:26] doesn't seem to be set up properly yet because I can't login to it, it asks for a password [22:00:20] (03PS1) 10Andrew Bogott: Turn on DiskFilter for nova scheduler. [puppet] - 10https://gerrit.wikimedia.org/r/233619 [22:01:42] (03PS1) 10Alexandros Kosiaris: Enable mask/umask of tilerator and kartotherian services [puppet] - 10https://gerrit.wikimedia.org/r/233620 (https://phabricator.wikimedia.org/T106637) [22:02:00] 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1569279 (10EBernhardson) wikidata-query-roots: Kill statistics-privatedata-users: Kill logstash-roots: Necessary udp2log-users: kill? elasticsearch-roots: Necessary deployment: Necessary? Tbh, i'm not... [22:02:58] (03PS3) 10Andrew Bogott: Add sus-migrate, yet another attempt to support cold migration. [puppet] - 10https://gerrit.wikimedia.org/r/233602 (https://phabricator.wikimedia.org/T109902) [22:03:39] (03PS2) 10Andrew Bogott: Turn on DiskFilter for nova scheduler. [puppet] - 10https://gerrit.wikimedia.org/r/233619 [22:04:03] (03CR) 10Andrew Bogott: [C: 032] Add sus-migrate, yet another attempt to support cold migration. [puppet] - 10https://gerrit.wikimedia.org/r/233602 (https://phabricator.wikimedia.org/T109902) (owner: 10Andrew Bogott) [22:04:27] Krenair: does it serve traffic, or can I just ignore it? [22:04:38] tgr, nope, it's codfw so no user traffic [22:04:44] internal access only AFAIK [22:05:26] (03CR) 10Andrew Bogott: [C: 032] Turn on DiskFilter for nova scheduler. [puppet] - 10https://gerrit.wikimedia.org/r/233619 (owner: 10Andrew Bogott) [22:07:47] (03PS1) 10Andrew Bogott: Rename some migrate scripts to better identify what they do [puppet] - 10https://gerrit.wikimedia.org/r/233621 [22:08:04] (03PS2) 10Andrew Bogott: Rename some migrate scripts to better identify what they do [puppet] - 10https://gerrit.wikimedia.org/r/233621 [22:08:48] (03CR) 10Dzahn: [C: 032] lists: add service IPs for lists on fermium [dns] - 10https://gerrit.wikimedia.org/r/233050 (owner: 10John F. Lewis) [22:08:50] (03PS3) 10Andrew Bogott: Rename some migrate scripts to better identify what they do [puppet] - 10https://gerrit.wikimedia.org/r/233621 [22:09:01] (03PS4) 10Andrew Bogott: Rename some migrate scripts to better identify what they do [puppet] - 10https://gerrit.wikimedia.org/r/233621 [22:09:05] (03PS1) 10Alexandros Kosiaris: maps: Add CREATE grant to tilerator on cassandra [puppet] - 10https://gerrit.wikimedia.org/r/233622 [22:10:51] PROBLEM - puppet last run on labcontrol1001 is CRITICAL Puppet has 1 failures [22:13:28] (03CR) 10Andrew Bogott: [C: 032] Rename some migrate scripts to better identify what they do [puppet] - 10https://gerrit.wikimedia.org/r/233621 (owner: 10Andrew Bogott) [22:14:51] RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [22:15:00] (03PS1) 10Alexandros Kosiaris: base::service_unit: ship systemd units in /lib [puppet] - 10https://gerrit.wikimedia.org/r/233626 [22:15:19] (03CR) 10Gergő Tisza: [C: 032] Change OAuth central wiki on Beta to metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233607 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [22:15:25] (03Merged) 10jenkins-bot: Change OAuth central wiki on Beta to metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233607 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [22:16:40] !log tgr@tin Synchronized wmf-config/CommonSettings-labs.php: change OAuth DB on beta +enable writes (duration: 00m 12s) [22:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:17:04] 6operations, 10Beta-Cluster, 7Database: Possible to run writes (e.g. UPDATE) on slave - https://phabricator.wikimedia.org/T110115#1569333 (10Mattflaschen) 3NEW [22:17:13] 6operations, 10Beta-Cluster, 7Database: Possible to run writes (e.g. UPDATE) on slave - https://phabricator.wikimedia.org/T110115#1569342 (10Mattflaschen) [22:20:22] (03PS1) 10Alexandros Kosiaris: maps: enable the ganglia diskstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/233627 [22:20:51] (03PS1) 10Andrew Bogott: Invite a few more virt hosts to the DiskFilter party. [puppet] - 10https://gerrit.wikimedia.org/r/233628 [22:21:03] (03PS2) 10Dzahn: mailman: add cronjob to delete old held messages [puppet] - 10https://gerrit.wikimedia.org/r/233617 (https://phabricator.wikimedia.org/T109838) [22:23:12] (03CR) 10Gergő Tisza: [C: 032] Set OAuth readonly for DB migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233608 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [22:23:18] (03CR) 10jenkins-bot: [V: 04-1] Set OAuth readonly for DB migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233608 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [22:23:31] (03PS2) 10Alexandros Kosiaris: maps: Add CREATE grant to tilerator on cassandra [puppet] - 10https://gerrit.wikimedia.org/r/233622 [22:23:37] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: Add CREATE grant to tilerator on cassandra [puppet] - 10https://gerrit.wikimedia.org/r/233622 (owner: 10Alexandros Kosiaris) [22:23:52] (03PS2) 10Alexandros Kosiaris: maps: enable the ganglia diskstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/233627 [22:23:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: enable the ganglia diskstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/233627 (owner: 10Alexandros Kosiaris) [22:24:04] (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233608 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [22:25:01] (03CR) 10Gergő Tisza: "Looks like jenkins breakage." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233608 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [22:25:29] (03CR) 10Gergő Tisza: [C: 032] Set OAuth readonly for DB migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233608 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [22:25:34] (03CR) 10jenkins-bot: [V: 04-1] Set OAuth readonly for DB migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233608 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [22:25:57] 6operations, 10Beta-Cluster, 7Database: Possible to run writes (e.g. UPDATE) on slave - https://phabricator.wikimedia.org/T110115#1569373 (10Krenair) All the `sql` command was changed to do was connect to a server returned by `wfGetLB()->getServerName(wfGetLB()->getReaderIndex())` by default instead of `wfGe... [22:26:10] (03PS2) 10Andrew Bogott: Invite a few more virt hosts to the DiskFilter party. [puppet] - 10https://gerrit.wikimedia.org/r/233628 [22:26:47] (03PS3) 10Dzahn: mailman: add cronjob to delete old held messages [puppet] - 10https://gerrit.wikimedia.org/r/233617 (https://phabricator.wikimedia.org/T109838) [22:27:06] (03CR) 10Dzahn: [C: 032] mailman: add cronjob to delete old held messages [puppet] - 10https://gerrit.wikimedia.org/r/233617 (https://phabricator.wikimedia.org/T109838) (owner: 10Dzahn) [22:27:43] (03CR) 10Andrew Bogott: [C: 032] Invite a few more virt hosts to the DiskFilter party. [puppet] - 10https://gerrit.wikimedia.org/r/233628 (owner: 10Andrew Bogott) [22:28:14] (03PS3) 10Andrew Bogott: Invite a few more virt hosts to the DiskFilter party. [puppet] - 10https://gerrit.wikimedia.org/r/233628 [22:28:36] (03CR) 10Gergő Tisza: [C: 04-1] "I'll leave this part to another deploy window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233608 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [22:28:40] 6operations, 10Beta-Cluster, 7Database: Possible to run writes (e.g. UPDATE) on slave - https://phabricator.wikimedia.org/T110115#1569386 (10Mattflaschen) >>! In T110115#1569373, @Krenair wrote: > Isn't that a production analytics host? What does it have to do with anything? There are also slaves of the wik... [22:32:19] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: clean up mailman data directory (moderated messages > 0.5 million) - https://phabricator.wikimedia.org/T109838#1569408 (10Dzahn) sent a mail to the list of listadmins and announced a plan to delete everything automatically that is older than 90 days... [22:35:01] PROBLEM - Disk space on labcontrol1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=93%) [22:35:28] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: clean up mailman data directory (moderated messages > 0.5 million) - https://phabricator.wikimedia.org/T109838#1569418 (10Dzahn) [22:37:07] andrewbogott: ^ labcontrol... [22:37:33] bah, of course [22:38:00] I'm on labcontrol1001 now if you need any hlep [22:40:15] no, it’s just that... [22:40:29] well, to avoid miscellaneous cursed permission issues the new migrate script migrates via labcontrol1001 [22:40:36] so, things have a tendency to pile up in /tmp [22:42:52] RECOVERY - Disk space on labcontrol1001 is OK: DISK OK [22:45:39] 10Ops-Access-Requests, 6operations, 3Mobile-Content-Service: Add bsitzmann and mholloway as deployers for the MobileApps service - https://phabricator.wikimedia.org/T109855#1569456 (10dr0ptp4kt) Approved. [22:52:59] (03PS6) 10Alex Monk: Add all groups to non-ops bastions, empty bastiononly group [puppet] - 10https://gerrit.wikimedia.org/r/227327 [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150824T2300). Please do the needful. [23:00:04] matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:44] matt_flaschen, want to do that or do you want me to? [23:00:45] Present [23:00:56] I can [23:01:17] Krenair, I'm still working on something else, so I'll see how that goes, and either add that, or just do the original one at the end of the window. [23:05:20] PROBLEM - HHVM rendering on mw2070 is CRITICAL - Socket timeout after 10 seconds [23:07:10] RECOVERY - HHVM rendering on mw2070 is OK: HTTP OK: HTTP/1.1 200 OK - 68397 bytes in 0.746 second response time [23:10:02] (03PS2) 10GWicke: Require openjdk-8-jdk [puppet] - 10https://gerrit.wikimedia.org/r/222037 [23:15:43] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Replace SSH key for mholloway - https://phabricator.wikimedia.org/T110064#1569540 (10Dzahn) @mholloway Could you login on phabricator with your (WMF) wiki account? On first login that will be associated with this Phabricator account and will confirm it's... [23:26:04] 6operations, 10Wikimedia-Mailing-lists: rsync all configs and archives one more time - https://phabricator.wikimedia.org/T110129#1569573 (10Dzahn) [23:26:11] 6operations, 10Wikimedia-Mailing-lists: rsync all configs and archives one more time - https://phabricator.wikimedia.org/T110129#1569576 (10Dzahn) p:5Triage>3High [23:26:26] 6operations, 10Wikimedia-Mailing-lists: rsync all configs and archives one more time - https://phabricator.wikimedia.org/T110129#1569556 (10Dzahn) [23:26:28] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1569577 (10Dzahn) [23:27:08] 6operations, 10Wikimedia-Mailing-lists: import all lists with the script we wrote for that - https://phabricator.wikimedia.org/T110131#1569579 (10Dzahn) 3NEW a:3Dzahn [23:31:28] 6operations, 10Wikimedia-Mailing-lists: lower lists.wikimedia.org TTL to 5 min - https://phabricator.wikimedia.org/T110132#1569590 (10Dzahn) 3NEW a:3Dzahn [23:31:36] (03PS2) 10Dzahn: lists: lower A[AAA] records to 5M [dns] - 10https://gerrit.wikimedia.org/r/233049 (https://phabricator.wikimedia.org/T110132) (owner: 10John F. Lewis) [23:32:42] 6operations, 10Wikimedia-Mailing-lists: announce scheduled downtime - https://phabricator.wikimedia.org/T110133#1569600 (10Dzahn) 3NEW a:3Dzahn [23:33:59] 6operations, 10Wikimedia-Mailing-lists: right before the switch: lower TTL to 10 seconds - https://phabricator.wikimedia.org/T110135#1569616 (10Dzahn) 3NEW a:3Dzahn [23:36:10] (03PS1) 10Alex Monk: Remove Annex namespace from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233636 (https://phabricator.wikimedia.org/T98896) [23:36:35] (03CR) 10Alex Monk: [C: 032] Remove Annex namespace from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233636 (https://phabricator.wikimedia.org/T98896) (owner: 10Alex Monk) [23:36:40] (03CR) 10jenkins-bot: [V: 04-1] Remove Annex namespace from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233636 (https://phabricator.wikimedia.org/T98896) (owner: 10Alex Monk) [23:36:44] (03PS1) 10Dzahn: lists: lower TTL to 10 seconds [dns] - 10https://gerrit.wikimedia.org/r/233637 (https://phabricator.wikimedia.org/T110135) [23:37:29] (03CR) 10Alex Monk: [V: 032] Remove Annex namespace from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233636 (https://phabricator.wikimedia.org/T98896) (owner: 10Alex Monk) [23:38:23] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/233636/ (duration: 00m 12s) [23:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:19] (03PS1) 10Alex Monk: Remove mw2180 from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/233638 [23:40:35] 6operations, 10Wikimedia-Mailing-lists: hold lists.wikimedia.org with exim - https://phabricator.wikimedia.org/T110136#1569634 (10Dzahn) 3NEW a:3Dzahn [23:41:48] Going to do the SWAT now. [23:41:57] (03CR) 10Mattflaschen: [C: 032] Get rid of $wgFlowOccupyPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228271 (https://phabricator.wikimedia.org/T105574) (owner: 10Matthias Mullie) [23:42:28] (03Merged) 10jenkins-bot: Get rid of $wgFlowOccupyPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228271 (https://phabricator.wikimedia.org/T105574) (owner: 10Matthias Mullie) [23:43:51] 6operations, 10Wikimedia-Mailing-lists: shut down mailman on sodium - https://phabricator.wikimedia.org/T110137#1569643 (10Dzahn) 3NEW a:3Dzahn [23:44:29] 6operations, 10Wikimedia-Mailing-lists: hold lists.wikimedia.org with exim - https://phabricator.wikimedia.org/T110136#1569650 (10Dzahn) a:5Dzahn>3JohnLewis per IRC talk: we need more details how to stop and merge the queues etc [23:45:13] 6operations, 10Wikimedia-Mailing-lists: Ensure mailman VM setup has adequate entropy for STARTTLS - https://phabricator.wikimedia.org/T109239#1569652 (10Dzahn) cajoel gave hardware RNGs to Faidon and Mark, not sure if we'll need them yet [23:45:59] !log mattflaschen@tin Synchronized wmf-config: Remove wgFlowOccupyPages (duration: 00m 12s) [23:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:27] (03CR) 10Mattflaschen: [C: 031] ""ECDSA host key for mw2180.codfw.wmnet has changed and you have requested strict checking."" [puppet] - 10https://gerrit.wikimedia.org/r/233638 (owner: 10Alex Monk) [23:47:31] 6operations, 10Wikimedia-Mailing-lists: Ensure mailman VM setup has adequate entropy for STARTTLS - https://phabricator.wikimedia.org/T109239#1569654 (10Dzahn) a:5Dzahn>3faidon [23:47:58] 6operations, 10Wikimedia-Mailing-lists: Ensure mailman VM setup has adequate entropy for STARTTLS - https://phabricator.wikimedia.org/T109239#1543725 (10Dzahn) [23:48:00] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1569655 (10Dzahn) [23:50:45] 6operations, 10Wikimedia-Mailing-lists: rsync the diff since mail was held on sodium - https://phabricator.wikimedia.org/T110138#1569660 (10Dzahn) 3NEW a:3Dzahn [23:53:05] 6operations, 10Wikimedia-Mailing-lists: switch over mailman service IP - https://phabricator.wikimedia.org/T110139#1569669 (10Dzahn) 3NEW a:3Dzahn [23:53:23] (03PS1) 10Dzahn: switch lists IP from sodium to fermium [dns] - 10https://gerrit.wikimedia.org/r/233642 (https://phabricator.wikimedia.org/T110139) [23:54:19] (03CR) 10Dzahn: [C: 04-1] switch lists IP from sodium to fermium [dns] - 10https://gerrit.wikimedia.org/r/233642 (https://phabricator.wikimedia.org/T110139) (owner: 10Dzahn) [23:55:29] 6operations, 10Wikimedia-Mailing-lists: send follow-up email, announce changes with new mailman version if any that have user impact - https://phabricator.wikimedia.org/T110140#1569679 (10Dzahn) 3NEW a:3Dzahn [23:55:44] 6operations, 10Wikimedia-Mailing-lists: TTL back up to normal 1H - https://phabricator.wikimedia.org/T110141#1569686 (10Dzahn) 3NEW a:3Dzahn [23:56:11] 6operations, 10Wikimedia-Mailing-lists: shutdown sodium, decom - https://phabricator.wikimedia.org/T110142#1569693 (10Dzahn) 3NEW a:3Dzahn