[00:00:05] (03CR) 10jenkins-bot: [V: 04-1] Add service deploy via scap [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 (owner: 10Thcipriani) [00:06:54] (03PS2) 10Thcipriani: Add service deploy via scap [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 [00:09:17] (03CR) 10BryanDavis: Add service deploy via scap (0310 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 (owner: 10Thcipriani) [00:42:18] 6operations, 10ops-eqiad: db1050 raid degraded - https://phabricator.wikimedia.org/T103110#1448586 (10Springle) Sounds good. [00:43:07] 6operations, 10ops-eqiad: db1050 raid degraded - https://phabricator.wikimedia.org/T103110#1448587 (10Springle) Also, yes, time to plan another batch. [01:04:07] springle: In case you didn't notice: labsdb1003 replication broke [01:06:37] * 1002 [01:21:07] PROBLEM - puppet last run on cp3006 is CRITICAL puppet fail [01:47:01] !log restarted labsdb1002 mysqld while troubleshooting replication [01:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:49:28] RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [01:53:33] (03PS1) 10Springle: add missing mod_rewrite rules [software/tendril] - 10https://gerrit.wikimedia.org/r/224378 [01:55:41] (03CR) 10Springle: "What is the new puppet deployment process? Will this be auto-deployed after a merge?" [software/tendril] - 10https://gerrit.wikimedia.org/r/224378 (owner: 10Springle) [02:10:17] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 00m 34s) [02:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:10:26] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-13 02:10:25+00:00 [02:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:10:36] (03CR) 10Ori.livneh: "tendril is deployed via git::clone with ensure => latest, which will update /srv/tendril to origin/master on the Puppet run following the " [software/tendril] - 10https://gerrit.wikimedia.org/r/224378 (owner: 10Springle) [02:14:05] (03CR) 10Ori.livneh: "This probably belongs in operations/puppet:modules/tendril/templates/apache/tendril.wikimedia.org.erb rather than here." [software/tendril] - 10https://gerrit.wikimedia.org/r/224378 (owner: 10Springle) [02:20:27] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 06m 16s) [02:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:43] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-13 02:23:43+00:00 [02:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:58] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jul 13 02:25:58 UTC 2015 (duration 25m 57s) [02:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:28] if ( $wmgUseAPIRequestLog ) { [02:46:29] $wgAPIRequestLog = "udp://locke.wikimedia.org:9000/$wgDBname"; [02:46:29] } [02:46:31] that's always false [02:46:36] but locke was in pmtpa :/ [03:29:02] (03PS3) 10Thcipriani: Add service deploy via scap [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 [04:16:57] Someone, probably you, from IP address 81.106.12.60, [04:17:00] has registered an account "Odder" with this email address on Wikimedia Commons. [04:18:27] I think the e-mail comes a few years late... [04:54:21] Krenair: Why the $wmgUseAPIRequestLog mention? Still sifting through the conf files looking for clutter? [05:07:32] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jul 13 05:07:32 UTC 2015 (duration 7m 31s) [05:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:12:11] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: TransparencyReport repository master in Gerrit silently made private - https://phabricator.wikimedia.org/T89640#1448641 (10MSyed) They Legal team needs this to be made public in the next few days to launch the new version of Transparency Report. @Prt... [06:17:33] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: TransparencyReport repository master in Gerrit silently made private - https://phabricator.wikimedia.org/T89640#1448643 (10Prtksxna) Specifically, we need is to force push the private repository to the public one once we are ready, so that the websit... [06:24:58] !log Experimenting with altering the localisation cache implementation for testwiki, operations/mediawiki-config on tin will have a local hack for a little bit [06:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:25:43] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Define the details of the hardware we need to run WDQS - https://phabricator.wikimedia.org/T104879#1448644 (10Smalyshev) Blazegraph recommendations: https://wiki.blazegraph.com/wiki/index.php/Hardware... [06:25:49] !log LocalisationUpdate failed: git pull of core failed [06:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:31:07] PROBLEM - puppet last run on db1028 is CRITICAL Puppet has 1 failures [06:31:09] PROBLEM - puppet last run on wtp2008 is CRITICAL Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on subra is CRITICAL Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on cp3048 is CRITICAL Puppet has 1 failures [06:31:58] PROBLEM - HHVM rendering on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50426 bytes in 0.034 second response time [06:31:59] PROBLEM - puppet last run on db1067 is CRITICAL Puppet has 1 failures [06:32:08] PROBLEM - puppet last run on mw2158 is CRITICAL Puppet has 1 failures [06:32:28] PROBLEM - Apache HTTP on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50426 bytes in 0.038 second response time [06:32:38] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures [06:32:58] ^ori? [06:33:18] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:33:44] the puppet run failures occur every day [06:33:54] it is apparently not worth fixing [06:33:56] HHVM rendering [06:34:28] yeah, mw1017 is me; that server is testwiki only [06:34:35] ok [06:35:05] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [06:35:38] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 68895 bytes in 0.150 second response time [06:36:07] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [06:37:49] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Define the details of the hardware we need to run WDQS - https://phabricator.wikimedia.org/T104879#1448652 (10Smalyshev) Based on the above, I think what we need is: * 64G memory * 300 G SSD * 4-8 co... [06:39:48] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1448655 (10Smalyshev) Based on T104879, I think what we need is: * 64G memory * 300 G SSD * 4-8 cores with 2.5 GHz min For... [06:53:47] RECOVERY - puppet last run on db1028 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:55:48] RECOVERY - puppet last run on wtp2008 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:55:58] RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:55:58] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:56:08] RECOVERY - puppet last run on cp3048 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:37] RECOVERY - puppet last run on db1067 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:47] RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:58] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:19:29] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. But this probably needs to be scheduled/announced, since enabling the ferm rules set will cut existing IRC connections." [puppet] - 10https://gerrit.wikimedia.org/r/223887 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [07:27:15] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. Migrating this into the nodepool Debian package at a later point should be straight-forward: You nedd a "After=network.t" [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [07:29:13] !log ori Synchronized php-1.26wmf13/includes/cache/LCStoreStaticArray.php: I3f63594a4: Fix variable name (follows Ib2c5856d) (duration: 00m 11s) [07:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:16:25] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/223849 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [08:21:09] 6operations, 6Multimedia, 6Performance-Team, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1448816 (10fgiunchedi) in terms of numbers on the swift side these are top50 sizes using the most space (data is from last augu... [08:27:59] (03PS1) 10Hashar: nodepool: fix authentication URL [puppet] - 10https://gerrit.wikimedia.org/r/224386 [08:28:48] (03PS2) 10Hashar: nodepool: fix authentication URL [puppet] - 10https://gerrit.wikimedia.org/r/224386 [08:43:38] PROBLEM - DPKG on labnodepool1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:44:00] (03CR) 10Muehlenhoff: "@resolve is a ferm function, i.e. the hostnames are resolved during ferm startup only, not by iptables itself. Since we only use DNS serve" [puppet] - 10https://gerrit.wikimedia.org/r/223537 (owner: 10Muehlenhoff) [08:50:28] !log upgrade graphite to 0.9.13 on graphite1001 and bounce one instance of carbon/cache [08:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:51:43] !log bounce carbon daemons on graphite1001 [08:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:52:42] !log bounce graphite-web on graphite1001 [08:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:07:40] (03CR) 10Hashar: "@mmoritz should I incorporate those changes in the puppet version right now? I am afraid to forget about them when I port them to the .deb" [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [09:13:58] 6operations, 7Graphite: Upgrade Graphite from 0.9.12 to 0.9.13 - https://phabricator.wikimedia.org/T104536#1448881 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi this is complete, I've built and uploaded the relevant graphite packages to our `trusty-wikimedia` repo and upgraded `graphite1001` (the standby... [09:17:00] ori: ^ [09:17:58] godog: nice, thank you. the 6 month view on https://grafana.wikimedia.org/#/dashboard/db/performance-metrics doesn't freeze my browser, so i guess that is fixed too [09:20:25] indeed, graphite DTRT now at least on that part [09:21:04] any other goodies? /me reads changelog [09:21:21] I see adminbot now doesn't collapse entries per-day anymore? https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:23] (03CR) 10Muehlenhoff: "If you incorporate them into the puppetry, then you would also need to update the nodepool package along with it; if you install the servi" [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [09:22:04] ooh, a whole bunch [09:22:19] godog: seems like a bug [09:25:06] indeedly, https://phabricator.wikimedia.org/T105678 [09:25:34] ( http://graphite.readthedocs.org/en/latest/releases/0_9_13.html btw, though you probably saw it ) [09:27:04] yeah bits and pieces, nothing that would point to hideous breakage [09:27:55] heh i was a bit alarmed by: "Refactor json responses for clarity (whilp) " [09:37:09] (03CR) 10Hashar: [C: 032 V: 032] "I got the package regenerated with this change included." [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/216660 (owner: 10Hashar) [09:42:33] (03PS1) 10Hashar: (WIP) systemd support (WIP) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/224390 (https://phabricator.wikimedia.org/T96867) [09:43:24] (03CR) 10Hashar: "Great! I proposed a draft change against the nodepool debian directory at https://gerrit.wikimedia.org/r/224390 to make sure I do not forg" [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [09:49:31] (03CR) 10Filippo Giunchedi: [C: 04-1] Add ferm rules for swift backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224071 (https://phabricator.wikimedia.org/T104965) (owner: 10Muehlenhoff) [09:52:19] (03PS5) 10Filippo Giunchedi: jobchron: log rotate [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya) [09:52:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] jobchron: log rotate [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya) [09:53:13] Warning: Unit file changed on disk, 'systemctl daemon-reload' recommended. [09:53:20] I am going to like systemd :-} [09:59:42] (03CR) 10Hashar: [C: 04-1] "Need to be refined. For example nodepool @INFO is logged twice in nodepool.log" [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [09:59:57] (03PS2) 10Hashar: nodepool: fix typo pruge -> purge [puppet] - 10https://gerrit.wikimedia.org/r/224177 [10:01:21] 6operations, 10MediaWiki-JobRunner, 5Patch-For-Review: jobchron logs are not rotated - https://phabricator.wikimedia.org/T96132#1449006 (10fgiunchedi) applied now on mw1001, seems to work: ``` mw1001:~$ ls -la /var/log/upstart/jobchron.log /var/log/mediawiki/jobchron.log -rw-r--r-- 1 root root 510083780 Ju... [10:05:11] moritzm: guten Tag :-) Have you got a chance to rebuild the jenkins-debian-glue package for our jessie-wikimedia ? :-D [10:06:51] hashar: not today, I can look into it tomorrow [10:08:41] moritzm: I am observing France holiday tomorrow. But we can sync up on wednesday if you are around :} [10:11:40] hashar: let's do that [10:22:37] !log ori Synchronized php-1.26wmf13/maintenance/rebuildLocalisationCache.php: 117f60a171: rebuildLocalisationCache: don't limit memory usage (duration: 00m 12s) [10:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:31:33] hashar: you know is there is a possibelity to mass restore pages? [10:31:58] or matanya :)^^ [10:35:57] <_joe_> hashar: yeah I think the whole western world should observe a national holiday tomorrow [10:37:58] bastille day!!!!! [10:38:01] Steinsplitter: no clue [10:39:37] _joe_: would you mind merging a puppet typo fix please ? :-) https://gerrit.wikimedia.org/r/#/c/224177/ [10:39:47] pruge -> purge [10:39:47] :-( [10:40:08] (03CR) 10Giuseppe Lavagetto: [C: 032] nodepool: fix typo pruge -> purge [puppet] - 10https://gerrit.wikimedia.org/r/224177 (owner: 10Hashar) [10:42:35] grazie mille [10:43:51] (03PS1) 10Gilles: Re-enable thumbnail chaining with a single reference thumbnail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224393 (https://phabricator.wikimedia.org/T105680) [10:44:23] (03CR) 10Gilles: [C: 04-1] "Don't deploy this for now, we need instrumentation first: T105681" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224393 (https://phabricator.wikimedia.org/T105680) (owner: 10Gilles) [10:45:01] Parameter source failed on File[/var/lib/nodepool/.ssh/dib_jenkins_id_rsa]: Cannot use relative URLs '-----BEGIN RSA PRIVATE KEY----- [10:45:01] .... [10:45:03] seriously puppet [10:45:42] I kept changing things back and forth :-( [10:51:01] (03PS1) 10Hashar: nodepool: SSH private key is a payload, not a resource [puppet] - 10https://gerrit.wikimedia.org/r/224396 [10:58:29] (03CR) 10Hashar: [C: 031 V: 032] "And I actually tested this one on labs :-}" [puppet] - 10https://gerrit.wikimedia.org/r/224396 (owner: 10Hashar) [10:58:49] !log restbase deploying 6dec79d [10:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:25:09] (03PS9) 10Matanya: monitoring: detect saturation of nf_conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/223560 [11:32:04] 6operations, 7Graphite, 7Monitoring: evaluate tessera dashboards - https://phabricator.wikimedia.org/T104366#1449123 (10fgiunchedi) >>! In T104366#1414782, @ori wrote: >>>! In T104366#1414737, @faidon wrote: >> Copying from IRC: >> - Is this being pitched as a Grafana replacement or something that will run i... [12:19:26] 6operations, 10RESTBase, 6Services, 7RESTBase-API: Expose RESTBase monitoring examples in Swagger spec - https://phabricator.wikimedia.org/T104850#1449151 (10Pchelolo) The fix has been deployed, closing. [12:19:32] 6operations, 6Services, 5Patch-For-Review, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1449153 (10Pchelolo) [12:19:36] 6operations, 10RESTBase, 6Services, 7RESTBase-API: Expose RESTBase monitoring examples in Swagger spec - https://phabricator.wikimedia.org/T104850#1449152 (10Pchelolo) 5Open>3Resolved [12:22:11] (03CR) 10Krinkle: "Please deploy. https://wikitech.wikimedia.org/wiki/Server_Admin_Log is still unusable." [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224212 (owner: 10BryanDavis) [12:52:19] (03PS1) 10Glaisher: Set $wmgUseFooterContactLink to true at Ukrainian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224405 (https://phabricator.wikimedia.org/T104924) [13:00:33] I keep hitting 503's on the beta cluster [13:00:47] Request: POST http://es.wikipedia.beta.wmflabs.org/w/index.php?title=Sensible_North_Carolina&action=delete, from 127.0.0.1 via deployment-cache-text02 deployment-cache-text02 ([127.0.0.1]:3128), Varnish XID 913670840 [13:00:47] Forwarded for: 80.217.41.134, 127.0.0.1 [13:00:47] Error: 503, Service Unavailable at Mon, 13 Jul 2015 13:00:02 GMT [13:01:15] wfm [13:02:03] Glaisher: Yeah, but once every 10 delete I try, I hit the 503-page [13:02:26] http://es.wikipedia.beta.wmflabs.org/wiki/Especial:CambiosRecientes o_O [13:02:41] Show bots... [13:02:49] !log krenair Synchronized php-1.26wmf13/extensions/Cite/extension.json: https://gerrit.wikimedia.org/r/#/c/224407/ - unbreak VE mobile, https://phabricator.wikimedia.org/T105686 (duration: 00m 12s) [13:02:53] I had to turn on a bot flag not to flood to much [13:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:02:58] Are you deleting each page manually? [13:03:34] Glaisher: not really..I have a script which mankes me click each article title once and it will do it and add the comment I have sepecified earlier. [13:03:37] makes* [13:03:53] The Nuke (massdelete) only works for the last 30(?) days. [13:04:50] Josve05a: how often do you hit them? [13:05:00] Yeah, it uses the recentchanges table, iirc [13:05:56] JohnFLewis: ... often ... 10-20 / min [13:06:06] Josve05a: mkay [13:07:20] I also keep getting logged out (in one tab), which caused the delete to fail due to "Error de permisos" and then logged back in again due to CentralAuth [13:07:49] !log updating openssl on cp* [13:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:27] However, these are more funnier to do, than to delte 1000 pages semi-manaully... http://deployment.wikimedia.beta.wmflabs.org/wiki/Special:GlobalBlockList [13:13:31] !log updating nginx/bind on cp* [13:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:21:50] (03PS1) 10Glaisher: Add import sources at gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224408 (https://phabricator.wikimedia.org/T104563) [13:27:18] 6operations, 7Graphite, 7Monitoring: deprecate gdash - https://phabricator.wikimedia.org/T104365#1449234 (10fgiunchedi) I took a look at exporting dashboards and graphs from gdash, each `.graph` file is a serialized ruby object (from `graphite_graph`) that can be converted to yaml with sth like this: ``` re... [13:29:38] (03CR) 10Andrew Bogott: [C: 032] nodepool: SSH private key is a payload, not a resource [puppet] - 10https://gerrit.wikimedia.org/r/224396 (owner: 10Hashar) [13:35:28] RECOVERY - DPKG on labnodepool1001 is OK: All packages OK [13:35:43] andrewbogott_afk: puppet is finally compiling the catalog on labndepool :-} [13:37:17] RECOVERY - puppet last run on labnodepool1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:38:00] (03CR) 10John F. Lewis: [C: 04-1] "not complete + syntax errors in the apache erb" [puppet] - 10https://gerrit.wikimedia.org/r/224210 (owner: 10John F. Lewis) [13:39:40] (03PS3) 10Hashar: nodepool: fix authentication URL [puppet] - 10https://gerrit.wikimedia.org/r/224386 [13:48:01] (03CR) 10Hashar: "Else I have:" [puppet] - 10https://gerrit.wikimedia.org/r/224386 (owner: 10Hashar) [13:49:18] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure: Update jenkins-debian-glue packages on Jessie to v0.13.0 - https://phabricator.wikimedia.org/T102106#1449275 (10Joe) p:5Triage>3Normal [13:49:44] 6operations, 7Graphite, 7Monitoring: evaluate tessera dashboards - https://phabricator.wikimedia.org/T104366#1449277 (10BBlack) My $0.02: Grafana: - I've seen links from others with interesting dashboards that were useful and they worked well - Example: [[ http://grafana.wikimedia.org/#/dashboard/db/cassand... [13:50:09] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure: Update jenkins-debian-glue packages on Jessie to v0.13.0 - https://phabricator.wikimedia.org/T102106#1449278 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [13:51:51] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1449280 (10Joe) 5Open>3declined [13:52:08] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1431047 (10Joe) We don't really need this, do we? [13:52:16] (03CR) 10Filippo Giunchedi: "misc nitpicks but overall looks good" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223560 (owner: 10Matanya) [13:54:11] (03CR) 10Filippo Giunchedi: "ping?" [puppet] - 10https://gerrit.wikimedia.org/r/222205 (https://phabricator.wikimedia.org/T101799) (owner: 10Filippo Giunchedi) [13:54:54] (03PS1) 10BBlack: 2layer: doc current weighting for eqiad-upload [puppet] - 10https://gerrit.wikimedia.org/r/224412 [13:54:57] (03PS1) 10BBlack: Switch on ECDSA Unified for misc-web cluster [puppet] - 10https://gerrit.wikimedia.org/r/224413 [13:55:15] (03CR) 10BBlack: [C: 032 V: 032] 2layer: doc current weighting for eqiad-upload [puppet] - 10https://gerrit.wikimedia.org/r/224412 (owner: 10BBlack) [13:56:16] 6operations, 6Discovery, 7Elasticsearch: unattended elasticsearch restarts - https://phabricator.wikimedia.org/T89845#1449284 (10fgiunchedi) [13:56:33] 6operations, 6Discovery, 7Elasticsearch: unattended elasticsearch restarts - https://phabricator.wikimedia.org/T89845#1449293 (10fgiunchedi) after upgrade to 1.6 this should also take way less time [13:56:39] (03CR) 10BBlack: [C: 032] Switch on ECDSA Unified for misc-web cluster [puppet] - 10https://gerrit.wikimedia.org/r/224413 (owner: 10BBlack) [13:56:42] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1449295 (10mobrovac) 5declined>3Open >>! In T104887#1449280, @Joe wrote: > We don't really need this, do we? I think we do. We are currently running on OpenJDK8 and it seemed to provide more stabili... [13:59:16] (03PS2) 10John F. Lewis: mail: hiera-ise mailman and lists [puppet] - 10https://gerrit.wikimedia.org/r/224210 [14:01:22] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1449309 (10MoritzMuehlenhoff) Do we have real evidence OpenJDK 8 makes a measurable difference to Casssandra? A properly OpenJDK 8 would require additional effort/maintenance on our side, while we can st... [14:01:35] _joe_: could you give the patch about a look through if you have spare time? [14:02:07] <_joe_> JohnFLewis: sure, lemme finish my script-from-hell and I'll do [14:02:21] okay [14:05:43] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1449322 (10fgiunchedi) see also related discussion in {T104888} [14:07:31] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1449334 (10mobrovac) For a period of time during the Cassandra semi-outages, we switched half of the nodes to OpenJDK8 and they appeared to be more stable than the others (in spite of the fact that the b... [14:11:07] (03PS1) 10BBlack: remove OCSP for planet.wm.o cert (borked) [puppet] - 10https://gerrit.wikimedia.org/r/224415 [14:11:11] 6operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#1449346 (10Joe) p:5Triage>3Normal [14:12:10] 6operations, 10ops-codfw: mc2001 not coming up after reboot - https://phabricator.wikimedia.org/T102222#1449352 (10Joe) p:5Triage>3Normal [14:13:45] (03CR) 10BBlack: [C: 032] remove OCSP for planet.wm.o cert (borked) [puppet] - 10https://gerrit.wikimedia.org/r/224415 (owner: 10BBlack) [14:13:49] 6operations, 10ops-codfw: mc2001 not coming up after reboot - https://phabricator.wikimedia.org/T102222#1449360 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [14:23:51] (03PS4) 10Andrew Bogott: nodepool: fix authentication URL [puppet] - 10https://gerrit.wikimedia.org/r/224386 (owner: 10Hashar) [14:25:26] (03CR) 10Andrew Bogott: [C: 032] nodepool: fix authentication URL [puppet] - 10https://gerrit.wikimedia.org/r/224386 (owner: 10Hashar) [14:26:58] (03CR) 10Hashar: "That worked :-) Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/224386 (owner: 10Hashar) [14:27:00] (03CR) 10Manybubbles: [C: 031] "Its a good thing to have. Might be better to make es-tool support what we want though." [puppet] - 10https://gerrit.wikimedia.org/r/223974 (owner: 10EBernhardson) [14:27:27] (03CR) 10Manybubbles: [C: 032 V: 032] "Time to deploy." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/221136 (https://phabricator.wikimedia.org/T103598) (owner: 10DCausse) [14:29:45] 6operations, 10RESTBase: Test JDK8 with Cassandra - https://phabricator.wikimedia.org/T104888#1449394 (10GWicke) See also: https://issues.apache.org/jira/browse/CASSANDRA-7486 [14:30:30] !log starting the elasticsearch 1.6.0 upgrade [14:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:30:49] !log es1.6 step 0: sync new versions of plugins [14:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:52] (03PS2) 10Andrew Bogott: role::puppet::server::labs Remove unused configuration [puppet] - 10https://gerrit.wikimedia.org/r/214637 (owner: 10Alexandros Kosiaris) [14:33:20] (03PS3) 10Andrew Bogott: role::puppet::server::labs Remove unused configuration [puppet] - 10https://gerrit.wikimedia.org/r/214637 (owner: 10Alexandros Kosiaris) [14:34:36] (03CR) 10Andrew Bogott: [C: 032] role::puppet::server::labs Remove unused configuration [puppet] - 10https://gerrit.wikimedia.org/r/214637 (owner: 10Alexandros Kosiaris) [14:37:42] (03PS1) 10BBlack: Switch on ECDSA Unified for primaries + parsoid [puppet] - 10https://gerrit.wikimedia.org/r/224419 [14:39:29] (03PS1) 10Alex Monk: Remove SVN admin and coder groups from mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224420 (https://phabricator.wikimedia.org/T105676) [14:40:26] (03CR) 10BBlack: [C: 032 V: 032] Switch on ECDSA Unified for primaries + parsoid [puppet] - 10https://gerrit.wikimedia.org/r/224419 (owner: 10BBlack) [14:45:33] !log es1.6 step 0: successfully synced new versions of plugins [14:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:54] !log es1.6 step 1: upgrade elasticsearch on elastic1001 -starting [14:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:46:59] (03CR) 10John F. Lewis: [C: 031] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/224205 (https://phabricator.wikimedia.org/T98816) (owner: 10Dzahn) [14:54:58] jouncebot, next [14:54:58] In 0 hour(s) and 5 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150713T1500) [14:55:28] !log after upgrading elasticsearch its init script no longer shuts down the old version of elasticsearch. so you have to manually kill it. that means the upgrade instructions will be "special" this time around. hopefully this is a one time thing. [14:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:56:57] Just realised I have to go AFK for a bit, will the person doing swat please include https://gerrit.wikimedia.org/r/#/c/224405/ ? It should be trivial to verify etc. [14:58:27] PROBLEM - puppet last run on ms-be1003 is CRITICAL Puppet has 1 failures [15:00:05] manybubbles anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150713T1500). Please do the needful. [15:06:02] looks like I'll do swat then [15:06:08] It's empty [15:06:10] Just looked [15:06:13] Krenair: can you add that patch to the deploy? [15:06:21] ostriches: Krenair proposed one [15:06:24] Ah, there is one, got it [15:06:28] mmk, nbd [15:07:25] (03CR) 10Manybubbles: [C: 032] Set $wmgUseFooterContactLink to true at Ukrainian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224405 (https://phabricator.wikimedia.org/T104924) (owner: 10Glaisher) [15:07:32] (03Merged) 10jenkins-bot: Set $wmgUseFooterContactLink to true at Ukrainian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224405 (https://phabricator.wikimedia.org/T104924) (owner: 10Glaisher) [15:09:25] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT enable footer contact link on ukwiki (duration: 00m 11s) [15:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:02] Krenair: all done and verified. works. thanks [15:11:15] !log all done SWATing. [15:11:18] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [15:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:57] (03PS2) 10Andrew Bogott: role::puppet::server::labs clean up allow_from [puppet] - 10https://gerrit.wikimedia.org/r/214638 (owner: 10Alexandros Kosiaris) [15:14:58] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61436 bytes in 1.751 second response time [15:16:30] (03CR) 10Andrew Bogott: [C: 032] role::puppet::server::labs clean up allow_from [puppet] - 10https://gerrit.wikimedia.org/r/214638 (owner: 10Alexandros Kosiaris) [15:23:27] (03Abandoned) 10Andrew Bogott: lint: fully qualify puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/214639 (owner: 10Alexandros Kosiaris) [15:23:29] (03PS5) 10BBlack: Rank all ECDHE > all DHE for "mid" level suites [puppet] - 10https://gerrit.wikimedia.org/r/224232 (https://phabricator.wikimedia.org/T105455) (owner: 10Chmarkine) [15:24:00] (03CR) 10BBlack: [C: 032 V: 032] Rank all ECDHE > all DHE for "mid" level suites [puppet] - 10https://gerrit.wikimedia.org/r/224232 (https://phabricator.wikimedia.org/T105455) (owner: 10Chmarkine) [15:24:59] RECOVERY - puppet last run on ms-be1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:33:11] bblack: around? [15:35:20] (03PS2) 10Andrew Bogott: Move certmanager hostname configuration to hiera [puppet] - 10https://gerrit.wikimedia.org/r/214640 (owner: 10Alexandros Kosiaris) [15:39:08] (03PS3) 10Andrew Bogott: Move certmanager hostname configuration to hiera [puppet] - 10https://gerrit.wikimedia.org/r/214640 (owner: 10Alexandros Kosiaris) [15:41:14] (03CR) 10Andrew Bogott: [C: 032] Move certmanager hostname configuration to hiera [puppet] - 10https://gerrit.wikimedia.org/r/214640 (owner: 10Alexandros Kosiaris) [15:49:23] (03PS2) 10Andrew Bogott: Rename role::puppet::server::labs [puppet] - 10https://gerrit.wikimedia.org/r/214641 (owner: 10Alexandros Kosiaris) [15:50:22] (03CR) 10Andrew Bogott: [C: 04-2] "I don't think this is worth the disruption it will cause." [puppet] - 10https://gerrit.wikimedia.org/r/214642 (owner: 10Alexandros Kosiaris) [15:50:37] (03CR) 10Andrew Bogott: [C: 032] Rename role::puppet::server::labs [puppet] - 10https://gerrit.wikimedia.org/r/214641 (owner: 10Alexandros Kosiaris) [15:51:33] (03CR) 10Yuvipanda: "https://tools.wmflabs.org/watroles/role/role::puppet::self it *is* doable - maybe a day of work for someone to hand fix those instances? d" [puppet] - 10https://gerrit.wikimedia.org/r/214642 (owner: 10Alexandros Kosiaris) [15:53:07] PROBLEM - Apache HTTP on mw1157 is CRITICAL - Socket timeout after 10 seconds [15:54:07] Krenair: would you mind reviewing https://gerrit.wikimedia.org/r/#/c/224087 ? [15:54:48] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.089 second response time [15:58:47] (03CR) 10Yuvipanda: [C: 04-1] Labs: Script to back labstore filesystems up (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) (owner: 10coren) [16:00:02] (03PS2) 10Yuvipanda: ldap: Allow projects to override user's loginshells [puppet] - 10https://gerrit.wikimedia.org/r/223828 (https://phabricator.wikimedia.org/T102395) [16:00:18] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Allow projects to override user's loginshells [puppet] - 10https://gerrit.wikimedia.org/r/223828 (https://phabricator.wikimedia.org/T102395) (owner: 10Yuvipanda) [16:01:01] andrewbogott, seems fine to me from a mediawiki perspective, assuming not using nutcracker is OK from an ops point of view [16:01:13] (03CR) 10Alex Monk: " andrewbogott, seems fine to me from a mediawiki perspective, assuming not using nutcracker is OK from an ops point of view" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224087 (https://phabricator.wikimedia.org/T102993) (owner: 10Andrew Bogott) [16:01:29] Krenair: thanks. For the most part we’ve agreed that it’s needless complexity. [16:01:36] and assuming that it's actually running on that port [16:01:44] but I think you checked that already? [16:02:08] yep, I’ve live-hacked the change. [16:02:19] Hm, I just missed swat didn’t I? [16:04:56] andrewbogott, yeah, but I'll sync it anyway as this only affects wikitech [16:05:06] thanks! [16:05:25] andrewbogott: hi.. can you help me with an old instance on labs? [16:05:33] hoo: I can try. name/project? [16:05:51] wikidata-suggester.wikidata-dev.eqiad.wmflabs [16:05:56] https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-dev [16:06:10] (03PS3) 10Alex Monk: Don't use nutcracker on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224087 (https://phabricator.wikimedia.org/T102993) (owner: 10Andrew Bogott) [16:06:20] It's not accepting my ssh keys... I guess either home is missing on it, or the permission on my authorized keys are screwed [16:06:27] Although that would be very surprising [16:06:29] (03CR) 10Alex Monk: [C: 032] "as agreed with andrew, syncing this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224087 (https://phabricator.wikimedia.org/T102993) (owner: 10Andrew Bogott) [16:06:35] (03Merged) 10jenkins-bot: Don't use nutcracker on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224087 (https://phabricator.wikimedia.org/T102993) (owner: 10Andrew Bogott) [16:06:38] already rebooted it, assuming that might be on a hung nts [16:06:43] * nfs [16:07:14] (03PS1) 10Chad: es-tool: Restart ganglia after restarting Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224435 [16:07:42] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/224087/ (duration: 00m 12s) [16:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:08:29] (03PS2) 10Chad: es-tool: Restart ganglia after restarting Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224435 [16:08:46] !log krenair Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/224087/ (duration: 00m 12s) [16:08:48] hoo: I’m not sure quite what’s broken, but it’s something dramatic. My root key works, I’ll see what I can do. [16:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:59] andrewbogott, done [16:10:06] great, wikitech still looks fine to me. [16:10:10] Thank you! [16:10:41] (03PS3) 10Andrew Bogott: beta: remove deployment-logstash1 [puppet] - 10https://gerrit.wikimedia.org/r/224219 (owner: 10BryanDavis) [16:11:28] (03CR) 10Andrew Bogott: [C: 032] beta: remove deployment-logstash1 [puppet] - 10https://gerrit.wikimedia.org/r/224219 (owner: 10BryanDavis) [16:11:57] hoo: can you log in now? [16:12:17] andrewbogott: we should get rid of nutcracker from puppet too I presume [16:12:27] YuviPanda: yeah, will look [16:13:02] YuviPanda: actually, that’s probably using puppet code that’s shared with other prod hosts right? [16:13:07] andrewbogott: no :( [16:13:16] not sure... [16:13:20] if so we can hiera it [16:13:30] hoo: so I see [16:13:34] YuviPanda: or not care :) [16:13:38] pffft :P [16:15:38] hoo: can you log in to /any/ of the instances in that project? [16:16:00] because I can’t, even though I’m a member [16:16:30] andrewbogott: Yes [16:16:38] hoo: for example? [16:16:38] wikidata-mobile.wikidata-dev.eqiad.wmflabs worked for me [16:16:44] it's the newest, I guess [16:17:07] ok, yes, I can reach that one as well. [16:17:22] oh, this is precise... [16:17:27] I’m going to reboot -suggester again, ok? [16:17:45] hoo: ^ ? [16:18:34] sure [16:19:19] (03CR) 10Manybubbles: [C: 031] es-tool: Restart ganglia after restarting Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224435 (owner: 10Chad) [16:21:15] (03CR) 10EBernhardson: "the difficulty with es-tool is that we need to perform actions on a mediawiki server(e.g. terbium or deployment-bastion), as well as doing" [puppet] - 10https://gerrit.wikimedia.org/r/223974 (owner: 10EBernhardson) [16:21:35] hoo: apt is really a mess on this box (and, I presume, the others) — mysql dependencies are broken which prevents the needed upgrade of openssh. I may be able to fix it, still poking. [16:22:12] I wonder why that box has mysql in the first place [16:23:58] 6operations, 6Discovery: Cirrus search in codfw - https://phabricator.wikimedia.org/T105703#1449703 (10Joe) 3NEW [16:25:29] (03CR) 10EBernhardson: "the elasticsearch instances in beta cluster don't have ganglia installed, so this needs some sort of conditional" [puppet] - 10https://gerrit.wikimedia.org/r/224435 (owner: 10Chad) [16:25:59] (03CR) 10Yuvipanda: "Use hiera('has_ganglia', true) - it's set to false in labs." [puppet] - 10https://gerrit.wikimedia.org/r/224435 (owner: 10Chad) [16:26:05] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1449713 (10BBlack) So, where are we at on removing the redirection workarounds here? I'd still like to get these removed ASA... [16:26:10] 6operations, 7Browser-Support-Internet-Explorer, 7HTTPS, 5HTTPS-by-default: Xbox 360 Internet Explorer unable to view Wikipedia - https://phabricator.wikimedia.org/T105455#1449714 (10brion) Ok, MS contacts are telling me they are in fact working on an increase in bit length for the DHE which should resolve... [16:28:19] hoo: try now? [16:29:01] andrewbogott: Works, yay! :) [16:29:02] thanks [16:29:19] hoo: mind if I reboot property-suggester for similar reasons? [16:29:42] 6operations: Evaluate traffic flow between the Jobrunners and the Cirrus cluster - https://phabricator.wikimedia.org/T105705#1449729 (10Joe) 3NEW [16:29:53] No... could be taht property-suggester is unused even [16:29:54] not sure [16:29:59] property-suggester has a private puppet repo with no changes in it. Why do people do this? [16:30:33] Mh... I guess people want to stay flexible [16:30:38] * hoo didn't set any of these up [16:31:12] which, in this case ‘stay flexible’ means, ensure that instance rots beyond usability [16:31:34] !log wikidata-dev updated local puppet and rebooting property-suggester [16:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:27] (03CR) 10Chad: "Then we'd have to template this, grrrrr :\" [puppet] - 10https://gerrit.wikimedia.org/r/224435 (owner: 10Chad) [16:35:08] (03CR) 10Yuvipanda: "You can also check domain and not do it for labs :)" [puppet] - 10https://gerrit.wikimedia.org/r/224435 (owner: 10Chad) [16:35:29] manybubbles_: crap. I should have told you about the killing the old elasticsearch instance thing. I had to make that change when upgrading the logstash cluster. I ended up just stopping the service before the apt upgrade and then starting after. [16:36:03] bd808: yup . I did that too eventually. did you see! running 2xelasticsearch pushed the load to ~40 on that node and made everything sad [16:36:13] I'm not sure why that is [16:36:38] yikes [16:36:39] 6operations, 6Services, 3Mobile Content Service, 7service-deployment-requests: New Service Request mobileapps - https://phabricator.wikimedia.org/T105538#1449755 (10Joe) a:3Joe [16:36:54] 6operations, 6Services, 3Mobile Content Service, 7service-deployment-requests: New Service Request mobileapps - https://phabricator.wikimedia.org/T105538#1446023 (10Joe) I'll work on this given I'm oncall this week [16:38:12] 6operations, 6Discovery: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1449758 (10Manybubbles) 3NEW [16:38:24] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1449758 (10Manybubbles) a:3EBernhardson [16:40:47] (03PS1) 10Yuvipanda: labstore: Remove NFS for project pdbhandler [puppet] - 10https://gerrit.wikimedia.org/r/224439 (https://phabricator.wikimedia.org/T105704) [16:41:11] (03PS2) 10Yuvipanda: labstore: Remove NFS for project pdbhandler [puppet] - 10https://gerrit.wikimedia.org/r/224439 (https://phabricator.wikimedia.org/T105704) [16:41:34] Coren: haha, you hit +2 but forgot to merge on https://gerrit.wikimedia.org/r/#/c/223830/1? [16:41:41] (03PS2) 10Yuvipanda: labstore: Excape grants *properly* [puppet] - 10https://gerrit.wikimedia.org/r/223830 [16:41:44] (03PS3) 10Yuvipanda: es-tool: Restart ganglia after restarting Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224435 (owner: 10Chad) [16:41:49] (03CR) 10Yuvipanda: [C: 032] labstore: Excape grants *properly* [puppet] - 10https://gerrit.wikimedia.org/r/223830 (owner: 10Yuvipanda) [16:42:03] (03CR) 10Yuvipanda: [V: 032] labstore: Excape grants *properly* [puppet] - 10https://gerrit.wikimedia.org/r/223830 (owner: 10Yuvipanda) [16:42:27] 6operations: Evaluate traffic flow between the Jobrunners and the Cirrus cluster - https://phabricator.wikimedia.org/T105705#1449776 (10Gage) a:3Gage [16:43:02] (03PS3) 10Yuvipanda: labstore: Remove NFS for project pdbhandler [puppet] - 10https://gerrit.wikimedia.org/r/224439 (https://phabricator.wikimedia.org/T105704) [16:43:11] (03CR) 10Yuvipanda: [C: 032] labstore: Remove NFS for project pdbhandler [puppet] - 10https://gerrit.wikimedia.org/r/224439 (https://phabricator.wikimedia.org/T105704) (owner: 10Yuvipanda) [16:43:19] (03CR) 10Yuvipanda: [V: 032] labstore: Remove NFS for project pdbhandler [puppet] - 10https://gerrit.wikimedia.org/r/224439 (https://phabricator.wikimedia.org/T105704) (owner: 10Yuvipanda) [16:43:37] 6operations, 10CirrusSearch, 6Discovery: Decide on and document the implementation for multi-DC CirrusSearch - https://phabricator.wikimedia.org/T105708#1449782 (10Manybubbles) 3NEW [16:43:58] 6operations, 10CirrusSearch, 6Discovery: Decide on and document the implementation for multi-DC CirrusSearch - https://phabricator.wikimedia.org/T105708#1449782 (10Manybubbles) Looking for a README style document on how it works and how to set it up. [16:44:28] 6operations, 10CirrusSearch, 6Discovery: Decide on and document the implementation for multi-DC CirrusSearch - https://phabricator.wikimedia.org/T105708#1449789 (10Manybubbles) Like - that readme can totally be a "this is how we plan for it to work" at first but will be modified to be "this is how to do it"... [16:44:29] (03PS1) 10Yuvipanda: labstore: Remove *.exports files [puppet] - 10https://gerrit.wikimedia.org/r/224441 [16:44:38] 6operations: Evaluate traffic flow between the Jobrunners and the Cirrus cluster - https://phabricator.wikimedia.org/T105705#1449791 (10Manybubbles) [16:44:39] 6operations, 10CirrusSearch, 6Discovery: Decide on and document the implementation for multi-DC CirrusSearch - https://phabricator.wikimedia.org/T105708#1449790 (10Manybubbles) [16:44:46] (03PS2) 10Yuvipanda: labstore: Remove *.exports files [puppet] - 10https://gerrit.wikimedia.org/r/224441 [16:44:54] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Remove *.exports files [puppet] - 10https://gerrit.wikimedia.org/r/224441 (owner: 10Yuvipanda) [16:45:16] 6operations, 10CirrusSearch, 6Discovery: Implement multi-DC support in CirrusSearch - https://phabricator.wikimedia.org/T105709#1449792 (10Manybubbles) 3NEW [16:45:27] (03PS1) 10Andrew Bogott: openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/224442 [16:45:29] 6operations, 10CirrusSearch, 6Discovery: Decide on and document the implementation for multi-DC CirrusSearch - https://phabricator.wikimedia.org/T105708#1449782 (10Manybubbles) [16:45:30] 6operations, 10CirrusSearch, 6Discovery: Implement multi-DC support in CirrusSearch - https://phabricator.wikimedia.org/T105709#1449792 (10Manybubbles) [16:45:45] 6operations, 6Discovery: Cirrus search in codfw - https://phabricator.wikimedia.org/T105703#1449800 (10Joe) [16:47:19] (03Abandoned) 10Andrew Bogott: openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/224442 (owner: 10Andrew Bogott) [16:47:46] (03PS4) 10Andrew Bogott: openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211356 (owner: 10Dzahn) [16:48:00] (03PS5) 10Andrew Bogott: openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211356 (owner: 10Dzahn) [16:48:09] 6operations, 6Discovery: Rollout CirrusSearch to codfw as a backup DC - https://phabricator.wikimedia.org/T105711#1449826 (10Manybubbles) 3NEW [16:48:24] 6operations, 6Services, 3Mobile Content Service, 7service-deployment-requests: New Service Request mobileapps - https://phabricator.wikimedia.org/T105538#1449832 (10Joe) p:5Triage>3Normal [16:49:28] (03CR) 10Andrew Bogott: [C: 032] openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211356 (owner: 10Dzahn) [16:50:14] anyone on the echo job failures? [16:50:37] ah - its just mysql connection failures. [16:50:42] oh, that sounds bad, actually [16:51:25] "just" :p [16:52:14] mutante: we need yet another rebuild of adminbot — do you mind? [16:55:15] I think he prepped everything for it on Friday evening [16:56:32] https://gerrit.wikimedia.org/r/#/c/224221/ is merged so I think it just needs the deb build, stuffed into apt and deployed [16:56:42] (03CR) 10Andrew Bogott: [C: 04-1] "I'd prefer that they not have user creation, if it's possible to untangle this from other rights." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222776 (owner: 10Alex Monk) [16:57:13] bd808: that includes the date-matching change? [16:57:19] yes [16:57:21] ok [16:57:34] Lemme see if he’s in today… I’m sure he can rebuild/deploy 10x faster than I can [16:58:05] yeah, looks like he’ll be a long any minute [16:58:31] should kill that deb [16:58:42] 6operations, 6Services, 10service-template-node, 7service-runner: Log levels not being respected on service-runner services on SCA - https://phabricator.wikimedia.org/T105500#1449866 (10mobrovac) 5Open>3Resolved [PR #41](https://github.com/wikimedia/service-runner/pull/41) fixes the issue of `service-r... [16:59:25] (03PS3) 10Alex Monk: wikitech: Clean up contentadmin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222776 [16:59:46] (03CR) 10Andrew Bogott: [C: 04-1] "I'm -1'ing since Alex didn't" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221825 (owner: 10BryanDavis) [17:00:34] (03CR) 10Andrew Bogott: [C: 031] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222776 (owner: 10Alex Monk) [17:01:39] YuviPanda: do it! [17:01:49] so much things to do, so little time :( [17:02:43] file a ticket, wait a year and then someone will do it in July 2016 ;) [17:05:38] 6operations, 7discovery-system, 5services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#1449878 (10Joe) p:5High>3Low [17:06:11] 6operations: Puppet catalog compiler is broken - https://phabricator.wikimedia.org/T96802#1449880 (10Joe) p:5High>3Low [17:07:04] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1449884 (10Joe) [17:08:27] (03PS4) 10Andrew Bogott: Direct labsconsole.wm.o through Apache cluster [dns] - 10https://gerrit.wikimedia.org/r/202791 (https://phabricator.wikimedia.org/T48554) (owner: 10Southparkfan) [17:10:12] The related patch in operations-puppet needs to be rebased top [17:10:14] Too* [17:10:45] (03CR) 10BryanDavis: [C: 04-2] "The proper fix for this will be a follow up to Iba6f115a79dbc0060f64a9095467d147cf53b8ae" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221825 (owner: 10BryanDavis) [17:10:47] (03PS2) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [17:10:49] <_joe_> JohnFLewis: sorry I'm stuck in a series of meetings :) [17:11:19] <_joe_> and just finished working on the script-of-death [17:11:33] (03CR) 10jenkins-bot: [V: 04-1] service::node: auto-monitoring of local endpoints [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [17:12:00] _joe_: it's cool - I know you have one in 50 minutes anyway :) [17:12:29] <_joe_> I have one right now :) [17:13:13] I don't know your meeting calendar though unless you send it me ;) [17:14:12] (03Abandoned) 10Yuvipanda: labstore: Escape _s properly [puppet] - 10https://gerrit.wikimedia.org/r/223780 (owner: 10Yuvipanda) [17:14:24] (03PS8) 10Andrew Bogott: Direct labsconsole.wm.o through Apache cluster [puppet] - 10https://gerrit.wikimedia.org/r/202788 (https://phabricator.wikimedia.org/T48554) (owner: 10Southparkfan) [17:14:56] andrewbogott: thank you for rebasing! [17:15:23] SPF|Cloud: np, thanks for the patch [17:15:47] Krenair or JohnFLewis, https://gerrit.wikimedia.org/r/#/c/202788/8 looks ok to me, would one of you like to sign off so I can merge? [17:16:39] (03CR) 10John F. Lewis: [C: 031] "sign off" [puppet] - 10https://gerrit.wikimedia.org/r/202788 (https://phabricator.wikimedia.org/T48554) (owner: 10Southparkfan) [17:17:17] (03PS9) 10Alex Monk: Redirect labsconsole to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/202788 (https://phabricator.wikimedia.org/T48554) (owner: 10Southparkfan) [17:17:28] changed commit message [17:17:31] PS9 :o [17:17:49] DNS part sends it through apache [17:18:04] this commit adds the apache rules so it knows what to do when it does get labsconsole requests [17:18:12] Krenair: heh true with the title :) [17:18:23] all looks good though [17:18:28] (03CR) 10Andrew Bogott: [C: 032] Redirect labsconsole to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/202788 (https://phabricator.wikimedia.org/T48554) (owner: 10Southparkfan) [17:19:04] I've annoyed JohnFLewis so many times because this patch was still not merged [17:19:13] Actually these patches* [17:19:36] (03CR) 10Andrew Bogott: [C: 032] Direct labsconsole.wm.o through Apache cluster [dns] - 10https://gerrit.wikimedia.org/r/202791 (https://phabricator.wikimedia.org/T48554) (owner: 10Southparkfan) [17:19:38] if only he annoyed someone who has +2 for them :) [17:22:22] (03PS3) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [17:23:08] (03CR) 10jenkins-bot: [V: 04-1] service::node: auto-monitoring of local endpoints [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [17:24:04] (03Abandoned) 10John F. Lewis: convert zirconium to private network [puppet] - 10https://gerrit.wikimedia.org/r/192827 (https://phabricator.wikimedia.org/T90676) (owner: 10John F. Lewis) [17:24:11] (03Abandoned) 10John F. Lewis: zirconium->wmnet dns [dns] - 10https://gerrit.wikimedia.org/r/192828 (https://phabricator.wikimedia.org/T90676) (owner: 10John F. Lewis) [17:25:16] 6operations, 5Patch-For-Review: remove public IP from zirconium - https://phabricator.wikimedia.org/T90676#1449951 (10JohnLewis) 5Open>3declined Setting this as declined (invalid probably better?) as with T105510 this becomes irrelevant. [17:32:21] andrewbogott: i'll do it now [17:32:28] mutante: thanks [17:32:36] (03PS4) 10Giuseppe Lavagetto: service::node: auto-monitoring of local endpoints [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/223328 (https://phabricator.wikimedia.org/T94821) [17:33:01] gwicke: cajoel also sails [17:33:12] cajoel: we were discussing sailing this morning ; [17:33:14] D [17:35:40] JohnFLewis: hehe indeed :P [17:35:42] yay, sailing! ;) [17:36:37] !log included adminbot_1.7.11 in APT repo [17:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:19] andrewbogott: that would be done, for precise [17:37:45] mutante: want me to restart the bots? [17:38:09] andrewbogott: yes, let's start with just one [17:39:31] !log this is the first test log of three [17:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:39:39] !log this is the second test log of three [17:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:39:58] andrewbogott: and failed [17:40:01] mutante: well… that looks no better to me [17:40:10] oh, wait, probably I need to deploy the new versions :) [17:40:38] eh, yes. please lookup which exec node that was on [17:40:48] connecting to tools login [17:41:33] I’m forcing some puppet runs, stay tuned... [17:43:47] andrewbogott: just precise, would have done just that one exec node first [17:44:22] mutante: except we can’t depend on the new bot appearing on the same exec node [17:44:26] why just precise btw? [17:44:45] andrewbogott: because that's what they run on [17:44:54] and the trusty hosts that broke with dpkg issue [17:45:00] probably shouldnt have it installed [17:46:31] hmm, yes, that's true about not being able to predict where it will run.. hmm [17:47:07] when restoring/deleting file on which ganglia graph i can see? [17:47:42] mutante: what version am I aiming for here? Puppet isn’t upgrading. [17:47:58] Steinsplitter: depends which host it is on [17:48:01] andrewbogott: 1.7.11 [17:48:09] ok, I still get 1.7.10 on tools-exec-1204. [17:48:19] mutante: ah, ok [17:48:22] I have to go grab a bite before the meeting, though, we can revisit after [17:48:42] ok [17:49:53] it's just apt-get update, simulated upgrade sees it [17:50:48] upgrades it to 1.7.11 on that exec node .. but i'll wait with restarting [17:50:55] so we dont run into meeting time [17:52:24] at least already confirmed no dpkg error on upgrade [17:55:20] hm, I wonder if puppet is not doing that anymore? [17:55:23] 6operations, 10Traffic, 7HTTPS: Drop AES-256 mid/compat lists. - https://phabricator.wikimedia.org/T105716#1450034 (10BBlack) 3NEW [17:57:04] 6operations, 10ops-eqiad: db1050 raid degraded - https://phabricator.wikimedia.org/T103110#1450041 (10Joe) p:5Triage>3High [17:58:40] (03PS1) 10BBlack: Drop AES256 from mid/compat lists [puppet] - 10https://gerrit.wikimedia.org/r/224445 (https://phabricator.wikimedia.org/T105716) [17:59:14] 6operations, 10Traffic: Fix/decom multiple-subdomain wikis in wikimedia.org - https://phabricator.wikimedia.org/T102826#1450050 (10BBlack) p:5Triage>3Normal [18:05:54] (03CR) 10Yuvipanda: Labs: Script to back labstore filesystems up (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) (owner: 10coren) [18:08:09] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1450086 (10Smalyshev) [18:08:12] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Define the details of the hardware we need to run WDQS - https://phabricator.wikimedia.org/T104879#1450085 (10Smalyshev) 5Open>3Resolved [18:09:40] (03PS1) 10Chad: Default to eqiad, not pmtpa [tools/scap] - 10https://gerrit.wikimedia.org/r/224449 [18:21:04] (03PS4) 10Alexandros Kosiaris: ferm rules for bacula [puppet] - 10https://gerrit.wikimedia.org/r/223849 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [18:21:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ferm rules for bacula [puppet] - 10https://gerrit.wikimedia.org/r/223849 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [18:24:07] 6operations: tin doesn't have access to same memcached as terbium and app servers - https://phabricator.wikimedia.org/T103198#1450166 (10mmodell) [18:25:18] 6operations, 6Analytics-Backlog, 6Performance-Team, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1450169 (10mmodell) [18:28:00] akosiaris: :) [18:29:47] !log es1.6 step 2: shut down extra instance of elasticsearch on elastic1021 [18:29:48] 7Puppet, 10Deployment-Systems, 6Release-Engineering: Puppet failure on deployment-sentry2 - https://phabricator.wikimedia.org/T78411#1450217 (10thcipriani) 5Open>3Resolved a:3thcipriani Puppet on deployment-sentry2 seems to be running just fine now [18:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:33:38] 6operations, 10ops-codfw, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1450228 (10Papaul) a:5Papaul>3fgiunchedi Disk replacement complete. [18:34:50] (03CR) 10Yuvipanda: "Should add a systemd unit as well :) Also I wonder if this should be done a lot more granularly:" [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) (owner: 10coren) [18:35:32] _joe_: did you have any time to look at https://gerrit.wikimedia.org/r/#/c/223580/ and https://gerrit.wikimedia.org/r/#/c/223663/ ? [18:35:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] ferm rules for IRCd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [18:35:38] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1450241 (10Tgr) >>! In T102566#1449713, @BBlack wrote: > Have we released new software with https:// URLs? No. [[ https://ge... [18:35:52] <_joe_> SMalyshev: not today, I plan to do that tomorrow [18:36:06] _joe_: ok, cool [18:36:31] _joe_: some of it is standard 'trebuchet-fication'. If I have time to look, would you object if I reviewed / merged? I'd hold off on anything I think might be controversial. [18:36:36] <_joe_> SMalyshev: I'll merge the first one for sure, it's a no-brainer [18:36:50] <_joe_> ori: go on, the first patch is trivial [18:36:54] <_joe_> the second surely needs work [18:37:09] (03PS3) 10Ori.livneh: T95679: add WDQS deployment repo [puppet] - 10https://gerrit.wikimedia.org/r/223580 (owner: 10Smalyshev) [18:37:11] nod [18:37:12] <_joe_> ori: I'm actually happy when you do my work, it makes my rep sheet shinier :) [18:37:17] (03CR) 10Ori.livneh: [C: 032 V: 032] T95679: add WDQS deployment repo [puppet] - 10https://gerrit.wikimedia.org/r/223580 (owner: 10Smalyshev) [18:37:52] oh yeah, https://gerrit.wikimedia.org/r/#/c/223663/ needs work [18:37:55] i'll take a look [18:38:08] that would be great [18:38:28] that's my first production project so please tell me if things there are wrong [18:38:29] <_joe_> ori: it needs some systemd unit files too, firewalls, etc :) [18:38:45] I've added the systemd part actually [18:38:50] not firewall though [18:38:51] <_joe_> oh nice [18:39:05] <_joe_> SMalyshev: used base::system_unit ? [18:39:34] 6operations, 6Analytics-Backlog, 6Performance-Team, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1450247 (10Krinkle) [18:39:40] _joe_: one of them. with updater I had a problem that I need to create the service but not run it. and looks like base::system_unit does not allow that [18:39:55] <_joe_> SMalyshev: oh!, I should fix that then [18:39:58] so I just set up the systemd file for that one [18:40:02] <_joe_> SMalyshev: open a bug :) [18:40:37] _joe_: ok. Looks like you can say "enable the service" or "disable the service" but you can't say "set it up but leave it alone in the state it is already in" [18:40:39] 6operations, 6Analytics-Backlog, 6Performance-Team, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1379593 (10Krinkle) Adding Release-Engineering since Deployment-Systems was removed. This is related to t... [18:41:33] <_joe_> SMalyshev: /win 18 [18:41:36] <_joe_> ouch [18:41:37] and updater is tricky because it needs to be run only after the dump is loaded. so when I initially deploy it should be down, but if I update and it's already manually started, then it should be up. not sure how to explain that to puppet [18:42:06] <_joe_> uhm I'm not sure I get what you want to do [18:42:12] <_joe_> but I'm in a meeting sorry :( [18:42:27] _joe_: ok :) I'll create a phab task [18:42:49] <_joe_> SMalyshev: I'll work on it tomorrow then, as well on the hardware request [18:42:49] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1450269 (10BBlack) p:5Triage>3Normal [18:42:58] _joe_: cool, thanks [18:43:00] <_joe_> we might be able to get a couple of spares [18:43:08] 6operations, 10Traffic, 7HTTPS, 7Mobile, 5Patch-For-Review: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1450270 (10BBlack) p:5Triage>3Normal [18:43:43] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1450272 (10BBlack) p:5Triage>3Normal [18:43:52] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1450273 (10BBlack) p:5Triage>3Normal [18:44:01] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Drop AES-256 mid/compat lists. - https://phabricator.wikimedia.org/T105716#1450276 (10BBlack) p:5Triage>3Normal [18:47:50] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1450295 (10mmodell) [18:48:00] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1450297 (10demon) >>! In T102566#1449713, @BBlack wrote: > So, where are we at on removing the redirection workarounds here?... [18:52:19] !log running populateContentModel.php --table=page on testwiki [18:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:57:26] (03CR) 10Smalyshev: [C: 04-1] "Need to add journalctl as discussed in the task." [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) (owner: 10Dzahn) [18:58:31] 6operations, 7Graphite: Upgrade Graphite from 0.9.12 to 0.9.13 - https://phabricator.wikimedia.org/T104536#1450310 (10Krinkle) Thanks! [19:00:44] <_joe_> mutante: I'll take a look at those permissions tomorrow, thanks for preparing the patch [19:01:05] _joe_: alright, cool [19:01:06] * _joe_ away [19:01:42] !log one of two [19:01:42] !log morebots - are you 1.7.11 ? [19:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:01:49] !log two of two [19:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:01:59] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1450325 (10BBlack) This is done for the primary unified cert on the cache clusters now. At some later point we may want to go ECDSA+RSA for some of our min... [19:01:59] good :) [19:02:02] working! [19:02:06] I’ll restart the others. [19:02:07] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1450328 (10BBlack) [19:02:08] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1450326 (10BBlack) 5Open>3Resolved a:3BBlack [19:02:09] andrewbogott: yes, nice!:0 [19:03:59] (03CR) 10Alexandros Kosiaris: "I missed that one. Indeed. It must not be on the module, it must be on the role. Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [19:06:24] 6operations, 7Graphite, 7Monitoring: deprecate gdash - https://phabricator.wikimedia.org/T104365#1450348 (10Krinkle) > gdash is vital for operations the replacement should allow at least for: > > * easily look back timespans (-3h -1d, etc) > * easy to share dashboards (i.e. via links) for other people to lo... [19:07:16] (03PS5) 10Dzahn: ferm rules for IRCd [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) [19:08:50] !log running populateContentModel.php --table=page on all small wikis [19:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:15:16] (03CR) 10Dzahn: "@akosaris, done" [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [19:17:27] (03CR) 10BryanDavis: [C: 032] Default to eqiad, not pmtpa [tools/scap] - 10https://gerrit.wikimedia.org/r/224449 (owner: 10Chad) [19:17:50] (03Merged) 10jenkins-bot: Default to eqiad, not pmtpa [tools/scap] - 10https://gerrit.wikimedia.org/r/224449 (owner: 10Chad) [19:19:24] (03CR) 10Dzahn: "can we have datacenter2: codfw ?:)" [tools/scap] - 10https://gerrit.wikimedia.org/r/224449 (owner: 10Chad) [19:19:28] (03PS1) 10Yuvipanda: labs: Don't wait for NFS mountpoints every puppet run [puppet] - 10https://gerrit.wikimedia.org/r/224461 [19:19:34] andrewbogott: ^ [19:20:16] (03PS2) 10Andrew Bogott: labs: Don't wait for NFS mountpoints every puppet run [puppet] - 10https://gerrit.wikimedia.org/r/224461 (owner: 10Yuvipanda) [19:20:19] mutante: I'm working on a scap patch for codfw support -- https://gerrit.wikimedia.org/r/#/c/224313/ [19:22:40] (03PS3) 10BBlack: move majority of privates/files usage to secret() [puppet] - 10https://gerrit.wikimedia.org/r/224213 [19:22:42] bd808: cool :) [19:22:58] 6operations, 7Graphite, 7Monitoring: deprecate gdash - https://phabricator.wikimedia.org/T104365#1450383 (10fgiunchedi) >>! In T104365#1450348, @Krinkle wrote: >> gdash is vital for operations the replacement should allow at least for: >> >> * easily look back timespans (-3h -1d, etc) >> * easy to share das... [19:23:48] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 20.00% of data above the critical threshold [100000000.0] [19:24:58] !log es1.6 step 3: upgrade elastic1002 [19:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:26:28] bd808: thanks for cleaning SAL up after adminlogbot :) [19:26:47] somebody needed to do it :) [19:27:17] i was looking for my phone [19:27:21] to get 2-factor auth... [19:27:29] to do the log cleanup:) [19:29:28] heh. I did a lot of wikignome work over the weekend [19:29:39] I can see why it scratches an itch for some folks [19:29:53] TIL: wikignome [19:30:05] I used to crop frames off images in commons [19:30:08] it was a fun activity [19:30:41] (03PS1) 10Andrew Bogott: Use the same hiera var ('puppetmaster') on labs and prod. [puppet] - 10https://gerrit.wikimedia.org/r/224465 [19:30:46] i used to fix spelling in category:vulgarities on wikt [19:32:43] godog: There is a whole taxonomy of wiki user types -- https://en.wikipedia.org/wiki/Wikipedia:WikiFauna [19:32:49] bblack: got another puppet error though the file exists in the secrets module locally [19:32:51] Error: /Stage[main]/Exim::Roled::Mail_relay/Exim4::Dkim[wikimedia.org]/File[/etc/exim4/dkim/wikimedia.org-wikimedia.key]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///private/dkim/wikimedia.org-wikimedia.key [19:34:26] JohnFLewis: the definitions for exim4 haven't been switched to secret() yet (but should eventually, https://gerrit.wikimedia.org/r/224213 blocked on labs-private + self-hosted) [19:34:58] bd808: haha nice, I had no idea [19:35:07] bblack: ah, so if I move the local private file into the old file/ path, should work? [19:35:20] JohnFLewis: so probably that file doesn't exist in files/ssl/ for labs yet, best to create in both. [19:35:36] so that it will keep working when it switches, later. [19:36:30] (I was expecting the transition period to be fairly short, but it may drag out a few days!) [19:37:25] (03CR) 10Yuvipanda: [C: 032] labs: Don't wait for NFS mountpoints every puppet run [puppet] - 10https://gerrit.wikimedia.org/r/224461 (owner: 10Yuvipanda) [19:42:46] !log Updated Wikidata's property suggester with data from today's json dump [19:42:48] (03PS1) 10Yuvipanda: puppetmaster: Make repo path for gitsync nonconfigurable [puppet] - 10https://gerrit.wikimedia.org/r/224493 [19:42:50] (03CR) 10BBlack: [C: 04-1] "Should wait on T92756 so as not to break random things on labs self-hosted puppetmasters." [puppet] - 10https://gerrit.wikimedia.org/r/224213 (owner: 10BBlack) [19:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:57] cant link to en.wikipedia from phab macros: [19:43:00] "The domain "en.wikipedia.org" resolves to the address "208.80.154.224", which is blacklisted for outbound requests." [19:43:14] i guess feature, should use commons links [19:45:56] (03PS1) 10Andrew Bogott: Update labs-private as well as production puppet repo. [puppet] - 10https://gerrit.wikimedia.org/r/224495 (https://phabricator.wikimedia.org/T92756) [19:46:17] 7Puppet, 6Labs: puppetmaster::gitsync should update labs/private repository as well - https://phabricator.wikimedia.org/T92756#1450444 (10Andrew) Attached patch can be much prettier if/when the script is rewritten in a proper language. [19:47:37] andrewbogott: whoops, conflict! I'm redoing the script slightly so it isn't just copy paste [19:47:45] andrewbogott: https://gerrit.wikimedia.org/r/#/c/224493/ and upcoming commits [19:47:47] ok, as you like [19:48:01] but why did you ping me about it if you were doing it? :) [19:50:08] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [19:50:39] andrewbogott: I pinged you about it *and* then started doing it... [19:50:47] sorry [19:51:18] :) [19:52:00] (03PS2) 10Andrew Bogott: Use the same hiera var ('puppetmaster') on labs and prod. [puppet] - 10https://gerrit.wikimedia.org/r/224465 [19:53:33] labs instances have no IPv6, right [19:53:50] what is the usual work around for that one , when testing puppet roles that set the mapped v6 addresses [19:54:33] (03PS2) 10Yuvipanda: puppetmaster: Cleanup of git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/224493 [19:54:38] bd808: bblack can you check my bash? ^ [19:54:47] * YuviPanda is terrible at bash, which is also one reason he rewrites things to python :P [19:57:27] YuviPanda: $(...) is easier to spot than `...`; you forgot to update the labs/private repo [19:57:42] bd808: yes, doing that in a separate commit [19:57:44] this one should be anoop [19:58:26] bd808: is ${1} also a thing/ [19:58:26] ? [19:58:29] or should I just do $1 [19:59:16] ${1} is valid. I like to use ${...} all the time to make vars stand out [19:59:50] * bd808 has idiosyncratic style [20:00:04] gwicke cscott arlolra subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150713T2000). Please do the needful. [20:00:18] 6operations, 7Graphite, 7Varnish: Varnish caches Grafana dashboard configuration too strongly - https://phabricator.wikimedia.org/T105734#1450492 (10Krinkle) 3NEW [20:00:19] no parsoid deploy this week. [20:00:27] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: TransparencyReport repository master in Gerrit silently made private - https://phabricator.wikimedia.org/T89640#1450499 (10akosiaris) Hello, Just force pushing to the public repo instead of the private should be sufficient. I am not sure how you hav... [20:01:54] YuviPanda: I haven't run it, but it looks ok. You can test pretty easily with a cherry-pick to deployment-salt [20:02:09] (03PS3) 10Yuvipanda: puppetmaster: Cleanup of git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/224493 [20:02:22] bd808: your style sounds sane to me so I have adopted it :) [20:02:38] And I see most of the style stuff I'm moaning about are changes you snuck in previously when I wasn't looking [20:02:39] (03PS4) 10Yuvipanda: puppetmaster: Cleanup of git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/224493 [20:02:50] (03PS6) 10Andrew Bogott: Split labs-specific bits of base into labs::base [puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [20:03:20] 6operations, 7Graphite, 7Varnish: Varnish caches Grafana dashboard configuration too strongly - https://phabricator.wikimedia.org/T105734#1450514 (10Krinkle) The cache being fragmented between HTTP and HTTPS also adds to the confusion. Let's enforce HTTPS for this domain? [20:04:00] bd808: yes, I also made it do hard reset and killed your autoshash code, which I should probably bring back [20:04:07] * YuviPanda is slow to recognize bd808's wisdom [20:04:34] yeah.. I have a cherry-pick somewhere in a labs project that puts back the nice behavior [20:04:55] the "this should never happen" reason is not very user friendly [20:05:38] You killed that when you were on a fix-all-the-hacks sprint and I think you just saw it as yet another hack [20:05:48] yeah [20:05:56] it would be easier to review if the un-template and change were separate :) [20:06:19] bd808: yeah, I was doing that and got really lazy :| sorry! [20:06:20] err [20:06:21] bblack: ^ [20:06:37] bd808: it currently doesnt' reset --hard though, it just fails [20:06:52] :) [20:06:57] reset --hard is evil [20:08:13] I do have an evil local workflow that involves reset --hard all the time [20:08:15] and no branching [20:08:18] I bet it'll bite me some day [20:09:37] bd808: I'll try put back the autostash this week [20:10:01] bd808: and you were right, I think I was in a 'eeeughh, so many convenience hacks, get rid of them all!!!' spree then [20:10:43] (03CR) 10BBlack: [C: 031] puppetmaster: Cleanup of git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/224493 (owner: 10Yuvipanda) [20:10:56] (03CR) 10Yuvipanda: [C: 032 V: 032] puppetmaster: Cleanup of git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/224493 (owner: 10Yuvipanda) [20:12:04] I really hope our post-gerrit system (arc?) makes life easier for reviewing one complete reviewed-thing containing many smaller commits for refactoring [20:12:13] (03PS1) 10Yuvipanda: puppetmaster: Update labs/private too [puppet] - 10https://gerrit.wikimedia.org/r/224499 (https://phabricator.wikimedia.org/T92756) [20:12:18] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: Update labs/private too [puppet] - 10https://gerrit.wikimedia.org/r/224499 (https://phabricator.wikimedia.org/T92756) (owner: 10Yuvipanda) [20:12:22] even when not doing heavy refactoring, it's nice to split file-renames and whitespace-only diffs as separate commits from functional bits [20:12:32] (03PS2) 10Yuvipanda: puppetmaster: Update labs/private too [puppet] - 10https://gerrit.wikimedia.org/r/224499 (https://phabricator.wikimedia.org/T92756) [20:12:34] yeah [20:12:38] (03CR) 10Andrew Bogott: "I just purged all refs to "puppetClass: base" in ldap, so it's now safe to have everything derive from role::labs::instance" [puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [20:14:08] (03PS7) 10Andrew Bogott: Split labs-specific bits of base into labs::base [puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [20:14:28] andrewbogott: https://gerrit.wikimedia.org/r/#/c/224499/ for the labs/private bit [20:14:53] (03CR) 10Dzahn: [C: 031] "certainly looks good, but who wants to restart ircd and the RC bot?:)" [puppet] - 10https://gerrit.wikimedia.org/r/224242 (https://phabricator.wikimedia.org/T87780) (owner: 10Glaisher) [20:15:03] 7Puppet, 6Labs, 3Labs-Sprint-106: puppetmaster::gitsync should update labs/private repository as well - https://phabricator.wikimedia.org/T92756#1450537 (10yuvipanda) [20:15:27] YuviPanda: I don’t think that will work; did you try it? [20:15:34] andrewbogott: yes. [20:16:00] andrewbogott: previous patch contained meat of changes tho [20:16:06] Oh, is GIT_SSH=/var/lib/git/ssh in the previous patch? [20:16:09] andrewbogott: https://gerrit.wikimedia.org/r/#/c/224493/ [20:16:11] yes [20:16:14] ok then :) [20:16:15] I'm using it for everything now [20:16:28] this was a bad split and I should feel bad, etc [20:16:34] (03CR) 10Yuvipanda: [C: 032] puppetmaster: Update labs/private too [puppet] - 10https://gerrit.wikimedia.org/r/224499 (https://phabricator.wikimedia.org/T92756) (owner: 10Yuvipanda) [20:17:38] (03CR) 10Alexandros Kosiaris: [C: 031] Optionally disable connection tracking per service [puppet] - 10https://gerrit.wikimedia.org/r/223751 (owner: 10Muehlenhoff) [20:18:21] (03Abandoned) 10Andrew Bogott: Update labs-private as well as production puppet repo. [puppet] - 10https://gerrit.wikimedia.org/r/224495 (https://phabricator.wikimedia.org/T92756) (owner: 10Andrew Bogott) [20:19:00] (03CR) 10Andrew Bogott: [C: 04-1] "Marking as -1 because Tim didn't and should have :)" [puppet] - 10https://gerrit.wikimedia.org/r/218117 (https://phabricator.wikimedia.org/T102367) (owner: 10Yuvipanda) [20:23:07] (03PS2) 10Yuvipanda: dynamicproxy: Allow proxies to be https only [puppet] - 10https://gerrit.wikimedia.org/r/218117 (https://phabricator.wikimedia.org/T102367) [20:23:15] (03PS1) 10Ori.livneh: Temporary hack to facilitate migration of l10n cache implementations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224501 [20:25:17] (03CR) 10Ori.livneh: [C: 032] Temporary hack to facilitate migration of l10n cache implementations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224501 (owner: 10Ori.livneh) [20:25:23] (03Merged) 10jenkins-bot: Temporary hack to facilitate migration of l10n cache implementations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224501 (owner: 10Ori.livneh) [20:26:13] (03PS1) 10Ori.livneh: Update my (=ori's) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/224503 [20:26:26] (03CR) 10Ori.livneh: [C: 032 V: 032] Update my (=ori's) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/224503 (owner: 10Ori.livneh) [20:30:45] !log ori Synchronized wmf-config/CommonSettings.php: Ieb62ee05: Temporary hack to facilitate migration of l10n cache implementations (duration: 00m 11s) [20:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:33:27] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.67% of data above the critical threshold [1000.0] [20:34:12] (03PS1) 10Yuvipanda: beta: Switch puppet cherry-pick check to new graphite metric name [puppet] - 10https://gerrit.wikimedia.org/r/224505 [20:36:12] bd808: ^ [20:36:16] not sure if that's of much use anymore tho [20:36:21] (03CR) 10Andrew Bogott: "This needs a manual rebase." [puppet] - 10https://gerrit.wikimedia.org/r/201880 (owner: 10Dzahn) [20:37:01] YuviPanda: I guess it depends if any opsen are watching that and actually doing reviews :) [20:37:12] heh, no is the answer I guess [20:37:24] I guess I do try [20:37:27] maybe I can clear some out [20:37:49] bd808: hmm, I'm not a reviewr on any of them [20:38:55] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Drop AES-256 mid/compat lists. - https://phabricator.wikimedia.org/T105716#1450565 (10BBlack) [20:42:03] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Drop AES-256 mid/compat lists. - https://phabricator.wikimedia.org/T105716#1450568 (10BBlack) To go a bit further on what's questionable about this: It's questionable whether we should even be trying to do enforcement against bad choices like this. Ignori... [20:42:11] !log ori Synchronized php-1.26wmf13/includes/api/ApiMain.php: f9c89d2814: Revert "Revert Count API module instantiations and Hook runs" (duration: 00m 13s) [20:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:44:59] ori: can we build the LCStoreStaticArray files from the json we ship for cdbs? [20:45:14] If so we could temporarily build both [20:45:39] godog: bd808: https://grafana.wikimedia.org/#/dashboard/db/varnish-http-errors [20:45:56] there was a big dropoff in pageviews a couple of hours ago, is that a known impact from a code change?? [20:46:35] https://gdash.wikimedia.org/dashboards/reqsum/ [20:46:46] (03PS1) 1020after4: remove include ::diamond [puppet] - 10https://gerrit.wikimedia.org/r/224507 [20:47:18] http req/sec didn't drop, so I assume it's not really lost traffic, just something that used to be a "pageview" isn't anymore? [20:47:56] the only thing I see in SAL around that time is manybubbles working on the elasticsearch cluster [20:48:05] I didn't do it [20:48:07] what? [20:48:09] mutante: https://dpaste.de/LWZK [20:48:21] mutante: I got that with reprepro, do you know how to fix that? [20:48:24] or what is wrong? [20:48:32] manybubbles: https://gdash.wikimedia.org/dashboards/reqsum/ -- pageviews fell off a cliff [20:48:39] the gash link earlier: ignore the first graph, look at the next two graphs, which are 8hr views of "pageviews/sec" and "http reqs/sec" [20:49:19] i SEE IT [20:49:26] CAPS LOCK [20:49:34] but the requests/sec didn't [20:49:38] yeah [20:49:49] if the reqs/sec did too, I'd be think LVS or TLS problems or something heh [20:49:50] I think I can safely say I didn't do it.for once [20:49:56] yeah [20:50:28] mutante: nvm, running export fixed it [20:50:53] (03CR) 10Manybubbles: "My bash script isn't good enough for you?!" [puppet] - 10https://gerrit.wikimedia.org/r/223974 (owner: 10EBernhardson) [20:51:11] YuviPanda: include the .changes files, not the .deb [20:51:16] oh also, my cipher stats died about 10 minutes ago heh: https://tessera.wikimedia.org/dashboards/6/ciphers?from=-1h [20:51:30] maybe graphite/logstash/statsd issues again there [20:51:39] YuviPanda: if you have one, that is [20:51:46] there isn't one unfortunately [20:51:53] hmm or maybe there is [20:51:53] let me see [20:52:30] 861x icinga "UNKNOWN"s right now re: hhvm, but probably not hhvm, just monitoring/stats infrastructure issues [20:52:43] "UNKNOWN: More than half of the datapoints are undefined " <- that sort of thing [20:53:01] YuviPanda: the whole "gpg is confused" stuff is an env issue. do "sudo su -" to become root [20:53:04] (03CR) 10Manybubbles: "Its cool. Whatever works. I'm worried about making something super opaque because it'll be hard to react to changes. Like how we really sh" [puppet] - 10https://gerrit.wikimedia.org/r/223974 (owner: 10EBernhardson) [20:53:09] bblack: ori turned the stats back on that we turned off on Friday [20:53:25] ah! [20:53:28] (03PS1) 10Legoktm: Set $wgCentralAuthStrict = true; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224510 (https://phabricator.wikimedia.org/T37707) [20:53:34] [20:42] !log ori Synchronized php-1.26wmf13/includes/api/ApiMain.php: f9c89d2814: Revert "Revert Count API module instantiations and Hook runs" (duration: 00m 13s) [20:53:37] (03PS1) 10Thcipriani: Fix /static 404s in beta [puppet] - 10https://gerrit.wikimedia.org/r/224511 (https://phabricator.wikimedia.org/T105541) [20:53:41] yeah that makes sense [20:53:56] ori: did you backport the core change too? [20:53:57] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1450603 (10yuvipanda) So the sid packages are a bit of a mess to backport to trusty, but I have now put in the official vagrant packages on Carbon. They are kind of shit... [20:53:59] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1450602 (10RobH) This was discussed in the operations meeting. The overall concensus seemed to be that full access for all icinga ale... [20:54:02] bd808: ^^ [20:54:04] ori: Revert Revert Count API module ... seems to be causing issues [20:54:45] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga - https://phabricator.wikimedia.org/T105229#1450604 (10RobH) This was discussed in the operations meeting. The overall concensus seemed to be that full access for all icinga alerting and commands for all serv... [20:58:32] 6operations, 10Beta-Cluster, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1450629 (10Dzahn) Can't they go to jessie right away? I guess they can't because i hear we don't build HHVM for jessie. [20:59:11] 6operations, 7Tracking: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) (tracking) - https://phabricator.wikimedia.org/T65899#1450634 (10Dzahn) How much of this can be "to Debian jessie" right away without going through an additional step? [21:00:28] Graphite stats for EventLogging and ResourceLoader also dropped by 90% from 1Mil to < 100K. [21:02:38] bblack: what is the issue? [21:04:15] ori: other statsd stats died like last time: (e.g. https://tessera.wikimedia.org/dashboards/6/ciphers?from=-1h ) + ~800x warnings about missing stats datapoints on other things: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=8&hoststatustypes=3&serviceprops=2097162&nostatusheader [21:04:20] I assume all related [21:04:34] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1450641 (10yuvipanda) It's an FPM built package and upstream doesn't care much https://github.com/mitchellh/vagrant-installers/issues/12 so I'm ok with this being used a... [21:06:55] oh yeah, different bug this time. [21:08:18] !log ori Synchronized php-1.26wmf13/includes/api/ApiMain.php: (no message) (duration: 00m 13s) [21:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:08:58] !log ori Synchronized php-1.26wmf13/includes/Hooks.php: (no message) (duration: 00m 12s) [21:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:38] 6operations, 7Graphite, 7Monitoring: deprecate gdash - https://phabricator.wikimedia.org/T104365#1450686 (10Krinkle) As example I've done reqerror, and part of graphite-eqiad: * https://grafana.wikimedia.org/#/dashboard/db/varnish-http-errors * https://grafana.wikimedia.org/#/dashboard/db/graphite-eqiad See... [21:25:28] andrewbogott: around for a quick question about labs? [21:25:37] what’s up? [21:25:39] (to make sure I'm not going in sane) [21:26:27] andrewbogott: firstly, idk why I have a proxy on mailman-jessie.wmflabs.org when we have a public IP but: if apache on port 80 keeps redirecting that domain to https, would it theoretically talk to apache again on port 80 creating an redirect loop? [21:26:31] mutante: ^ [21:27:13] or am I trying to blame labs for a completely irrelevant issue/being lazy and using the proxy badly creating an issue? :) [21:27:15] https isn’t port 80, so I wouldn’t think that would happen [21:27:35] If you’re using a web proxy then you should only ever get port 80 traffic on your instance. [21:27:44] https is handled by the proxy and then relayed on 80. [21:27:53] right, that's the issue then [21:27:57] If your instance is redirecting from 80 then that will definitely not work [21:28:08] * JohnFLewis needs to kill the proxy [21:28:18] your code needs to recognize X-FORWARDED-PROTO [21:28:25] and use that to determine protocol [21:28:27] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 18.75% of data above the critical threshold [100000000.0] [21:28:29] and not which port it's listening on [21:28:38] JohnFLewis: I'm not sure about the labs-specific bits here, but usually when we have an HTTPS proxy in front of an apache, with apache doing the redirect, we make the redirect conditional on !X-Forwarded-Proto: https [21:28:51] err, what Yuvi said faster heh [21:29:30] but that doesnt apply to mailman [21:29:30] bblack: YuviPanda|zzz: honestly the proxy needs to die here anyway, it's legacy from pre-public IP days and if we have it, its a greater difference between production and labs for mailman :) [21:29:37] because mailman is not behind misc-web [21:29:44] yeah [21:30:03] if you're not going to have a proxy (even a local one), then have apache listen on ports 443 and 80, and only redirect from 80. [21:30:27] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [21:31:46] bblack: which is what the case should be and which I've done now but I'm getting an nginx 404 [21:32:32] kudos for not going the "role mailman::labs" way but doing it right [21:32:33] !log renaming ~3k users who were originally missed for SULF [21:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:33:10] (03PS1) 10Jgreen: add frack subnets to network.pp, add frack-codfw to icinga firewall policy [puppet] - 10https://gerrit.wikimedia.org/r/224519 [21:34:35] (03CR) 10Alex Monk: add frack subnets to network.pp, add frack-codfw to icinga firewall policy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/224519 (owner: 10Jgreen) [21:36:07] JohnFLewis: [21:36:09] mailman-jessie.wmflabs.org has address 208.80.155.206 [21:36:09] mailman-jessie.wmflabs.org has address 208.80.155.156 [21:36:15] one is the proxy and one is direct? [21:36:23] first one [21:36:27] second is proxy [21:36:28] somehow it has both [21:36:37] the deletion of the proxy didnt quite work yet [21:36:39] it’s probably cached still from when the proxy existed [21:36:51] right [21:38:57] (03PS2) 10Jgreen: add frack subnets to network.pp, add frack-codfw to icinga firewall policy [puppet] - 10https://gerrit.wikimedia.org/r/224519 [21:40:49] oooh, now that one will be fun [21:40:51] "You cannot visit mailman-jessie.wmflabs.org right now because the website uses HSTS." [21:41:30] oh, right, does it?:) but we dont want to use the proxy anymore :p [21:41:34] (03CR) 10Andrew Bogott: [C: 031] dynamicproxy: Allow proxies to be https only [puppet] - 10https://gerrit.wikimedia.org/r/218117 (https://phabricator.wikimedia.org/T102367) (owner: 10Yuvipanda) [21:41:42] (03CR) 10Matanya: [C: 04-1] "what alwx said + you can't give private and public subnets the same name." [puppet] - 10https://gerrit.wikimedia.org/r/224519 (owner: 10Jgreen) [21:42:05] I think for testing you’ll have to check in a self-signed cert into labs/private and then click through the exception [21:42:10] 6operations, 7Graphite: Upgrade to Grafana v2.x - https://phabricator.wikimedia.org/T104738#1450741 (10Krinkle) [21:42:19] andrewbogott: that's exactly what we did [21:42:28] andrewbogott: but before it was on the instance/domainproxy [21:42:54] (03PS1) 10Ori.livneh: Don't assume current l10n cache files are .cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/224520 [21:43:06] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 (Hadoop / HDFS / Hue) for for tbayer - https://phabricator.wikimedia.org/T105748#1450742 (10dr0ptp4kt) 3NEW [21:43:37] andrewbogott: in the project puppet master we have a self-sign cert which is that [21:43:42] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 (Hadoop / HDFS / Hue) for for tbayer - https://phabricator.wikimedia.org/T105748#1450754 (10dr0ptp4kt) [21:43:55] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 (Hadoop / HDFS / Hue) for tbayer - https://phabricator.wikimedia.org/T105748#1450755 (10dr0ptp4kt) [21:43:56] unless I did that wrong? [21:44:07] Having it local is fine — either way :) [21:44:19] 10Ops-Access-Requests, 6operations, 6Reading-Admin: Requesting access to stat1002 (Hadoop / HDFS / Hue) for tbayer - https://phabricator.wikimedia.org/T105748#1450742 (10dr0ptp4kt) [21:45:12] andrewbogott: though it is set up with a self sign cert for mailman-jessie.wmflabs.org, unless it is supposed to be different? [21:45:38] *shrug* should be for whatever hostname you’re using [21:45:49] * JohnFLewis looks into it more [21:45:54] (03PS3) 10Jgreen: add frack subnets to network.pp, add frack-codfw to icinga firewall policy [puppet] - 10https://gerrit.wikimedia.org/r/224519 [21:46:06] that host redirects me to a 403 at /mailman/listinfo ? [21:47:28] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [21:49:35] mutante: what do you get? [21:49:54] I get the HSTS warning I can't click through yet Krenair gets something. hm? [21:50:10] mutante / JohnFLewis: just catching up on the above - I would highly advise disabling whatever's setting HSTS while you're experimenting. [21:50:26] the domainproxy is i have to assume [21:50:27] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Need a way to create a systemd service that is initially stopped - https://phabricator.wikimedia.org/T105749#1450762 (10Smalyshev) 3NEW a:3Joe [21:50:39] HSTS is something you do at the very end for production things. The consequences while experimenting can be very ugly if it gets output with the wrong domainname or options. [21:50:58] yes, it already happened [21:51:06] because it was set to use the labs proxy [21:51:19] we did not change apache config of the mailman service [21:51:28] well at least you're restricted to messing with wmflabs.org [21:51:49] but still, there's potential to affect other services on that proxy or whatever. I don't know the details of that setup, just saying. [21:52:10] since it's just a header, it's easy to set without much testing at a much later stage. [21:52:34] (03CR) 1020after4: [C: 031] Fix /static 404s in beta [puppet] - 10https://gerrit.wikimedia.org/r/224511 (https://phabricator.wikimedia.org/T105541) (owner: 10Thcipriani) [21:53:36] JohnFLewis, yes, sorry, I do get a warning [21:54:22] JohnFLewis, I can ignore it though, which I did for this case [21:54:29] !log Debugging metric issue on graphite1001, brief stats drop possible [21:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:55:07] deleting it from chrome, I was able to click through it now [21:55:19] removed the hsts header as well (bblack) [21:55:51] ori: I see a non-deployed Revert "Revert "Revert Count API module instantiations and Hook runs"" commit that's before the one I want to deploy, is it ok to sync out? [21:56:13] legoktm: can you hang on for a couple of minutes? [21:56:16] sure [21:56:22] thanks [22:00:09] !log es1.6 step 4: upgrade elastic1003 [22:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:42] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1450817 (10GWicke) We discussed this on IRC, but didn't mention it here yet: The histograms reported by the graphite reporter... [22:02:47] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.67% of data above the critical threshold [1000.0] [22:04:14] Krenair: fixed [22:04:23] needed to do some apache changes because 2.4 :) [22:05:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [22:06:34] legoktm: almost done, sorry [22:09:51] ok :) [22:10:50] andrewbogott, bblack, YuviPanda|zzz: thanks for the input! got it mostly working now barring apache version changes though we can change that right now I don't think. [22:10:58] PROBLEM - statsdlb process on graphite2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name statsdlb [22:11:05] mutante: all looks good now, a puppet run seems to get things working :) [22:11:15] robh: ^ [22:11:48] PROBLEM - puppet last run on graphite1001 is CRITICAL Puppet has 1 failures [22:11:57] (03PS3) 10John F. Lewis: mail: hiera-ise mailman and lists [puppet] - 10https://gerrit.wikimedia.org/r/224210 [22:13:24] legoktm: go ahead, sorry [22:13:46] ori: ok, should I sync your change? [22:13:46] JohnFLewis: this is labs only or applied to sodium as well? [22:13:52] legoktm: please [22:13:56] robh: labs only [22:14:08] I've not added heira values like this before so im not comofrtable with this being a +2 for production is all =] [22:14:21] cool, so like other changes, planned for jessie deployment of mailman [22:14:27] that I'll gladly +2 =] [22:15:00] !log legoktm Synchronized php-1.26wmf13/includes/Hooks.php: Revert "Revert "Revert Count API module instantiations and Hook runs"" (duration: 00m 13s) [22:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:15:05] robh: honestly, I'm waiting for you guys to come to me now and go 'so then, let's start this thing shall we?' [22:15:18] labs being working was more or less my final thing to do [22:15:28] !log legoktm Synchronized php-1.26wmf13/includes/api/ApiMain.php: Revert "Revert "Revert Count API module instantiations and Hook runs"" (duration: 00m 12s) [22:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:15:32] ori: ^ [22:15:41] thanks [22:15:56] Well, I expect either this or next week we'll get some kind of VM up and running and do some test imports yep, the progress from your labs work being the major reason we could do that =] [22:16:13] !log legoktm Synchronized php-1.26wmf13/includes/User.php: Add 'AuthPluginStrict' log to identify users who are unable to authenticate (duration: 00m 13s) [22:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:16:26] though I know we won't be able to do any actual major migration for at least two weeks prior to wikimania, as we have to have either mark or faidon do the final reviews [22:16:33] and faidon is @ wikimania [22:16:45] robh: slot the VM request before an ops meeting with the plan of doing stuff [22:16:58] oh, daniel tried to push that today already =] [22:16:58] so my shell can be approved in a meeting so I'll be fully there for it [22:17:10] it was politely pushed as we ran out of time =P [22:17:27] but the vm request doesnt need ops meeting, you are correct the sudo does [22:17:51] (03CR) 10RobH: [C: 032] "these arent applied to" [puppet] - 10https://gerrit.wikimedia.org/r/224210 (owner: 10John F. Lewis) [22:17:53] but the sudo is extremely helpful for one of us to help here :) [22:18:38] RECOVERY - statsdlb process on graphite2001 is OK: PROCS OK: 1 process with command name statsdlb [22:18:54] agreed 100% [22:19:12] two weeks prior to wikimania is already here though [22:19:20] ? [22:19:26] mutante: how so, wikimania is this week. [22:19:29] oh, i think i just got it wrong [22:19:42] im saying faidon is at wikimania, and such should be focused on wikimania [22:20:05] so i dont expect us to be able to, even if we got it working 100% today, migrate off sodium for at least 2 weeks AFTER the conference [22:20:24] as we'll need either faidon or mark to have time to fully dedicate a day or two in review [22:20:37] PROBLEM - statsite backend instances on graphite2001 is CRITICAL Not all configured statsite instances are running. [22:20:40] or perhaps only an hour, dunno. [22:20:50] (03CR) 10Legoktm: [C: 032] Set $wgCentralAuthStrict = true; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224510 (https://phabricator.wikimedia.org/T37707) (owner: 10Legoktm) [22:20:54] i just know if you go to wikimania, you come home with lots of projects =] [22:20:56] (03Merged) 10jenkins-bot: Set $wgCentralAuthStrict = true; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224510 (https://phabricator.wikimedia.org/T37707) (owner: 10Legoktm) [22:21:01] (that you want to work on ;) [22:21:35] robh: in that case, let's hold 'Wikimania Lucid' ;) [22:22:51] hrmm... lets see what brion's server is running.. (my bouncer) [22:23:04] huh, 12.04 [22:23:21] ask him to downgrade to Lucid [22:23:27] that isn't as outdated as I expected [22:23:36] considering that both brion and I could upgrade it, and yet we dont. [22:23:46] no one is paying us to work on it. [22:24:27] JohnFLewis: do not adopt the above policy. [22:24:32] ;D [22:24:50] ori: sync-common is taking a long time on mw1017 (I'm testing my config change before it goes out)...is anything different about mw1017 right now w/r to scap? [22:25:13] ori: ...and it completed as I said that. I'm just being impatient :P [22:25:37] i see a bunch of phab tasks for it and testwiki [22:25:38] legoktm: sync-common --verbose is your friend when doing things like that :) [22:25:44] but they are to migrate testwiki to it, so dunno [22:26:21] we just should take a mw system and rename it testwikiXXXX =P [22:26:31] (im only partially sarcastic about that) [22:26:33] if it's supposed to be a VM we still need a request for it [22:26:34] robh: ganeti :p [22:26:50] heh [22:27:02] testwiki100[1-5] in ganeti, then we have testwiki redundancy ;) [22:27:14] the first unplanned ganeti host outage will be interesting. [22:27:29] even more so if lists will be on it [22:27:30] 2015-07-13 22:27:10 mw1017 testwiki AuthPluginStrict INFO: Authentication denied for Lego-test {"private":false} [22:27:32] YAY [22:28:08] RECOVERY - statsite backend instances on graphite2001 is OK All defined statsite jobs are runnning. [22:28:14] well, IF, it's supposed to be a VM [22:28:15] legoktm: they said it couldn't be done :) [22:28:29] hmm, someone else is scaping? I can't get the lock [22:29:06] ori: ? [22:29:21] mutante: the talk is it is a VM and we have to work that way as it's what mark says is the plan [22:31:02] -rw-rw-r-- 1 ori wikidev 0 Jul 13 22:28 /var/lock/scap [22:32:13] legoktm: pid 1930 -- running for 22 minutes [22:32:25] :/ [22:32:32] * legoktm waits [22:32:57] RECOVERY - puppet last run on graphite1001 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [22:35:09] errr... not for 22 minutes, since 22:21 [22:35:22] robh: aren't we also supposed to be creating a tracking/detailed/summarisy ticket for the migration project? [22:35:47] he must have the env flag set to suppress the announce messages? [22:35:47] saw it a few times but nothing has been done from what I can see? :) [22:35:55] i have a draft in a text doc right now actually [22:36:03] cool [22:36:05] JohnFLewis: i started doing it this morning during the meeting [22:36:10] lemme finish it up and push to task [22:36:17] okay :) [22:36:21] Anything known to be broken? [22:36:31] except of switf [22:36:35] * swift, even [22:37:46] define broken? [22:38:09] Well, heavily degraded, whatever [22:38:20] edit rates just imploded for Wikidata [22:38:22] and I wonder why [22:38:27] but I only see swift error [22:38:28] s [22:38:47] there were some graphite issues, but I think they've been fixed now [22:39:23] legoktm: blargh, sorry [22:39:30] Maybe people just stopped editing [22:39:35] some bot or so [22:40:14] maybe they're all flying to mexico? ;) [22:40:20] legoktm: all yours [22:40:33] legoktm: Possible [22:40:43] ori: thanks [22:40:59] !log legoktm Synchronized wmf-config/InitialiseSettings.php: Set $wgCentralAuthStrict = true; (duration: 00m 12s) [22:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:41:25] !log legoktm Synchronized wmf-config/CommonSettings.php: Set $wgCentralAuthStrict = true; (duration: 00m 13s) [22:41:28] someone should give swift/ media some love [22:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:41:48] woohoo [22:41:50] {{done}} [22:41:57] PROBLEM - puppet last run on sodium is CRITICAL puppet fail [22:42:32] JohnFLewis: robh: that didnt work [22:42:39] Could not find data item mailman::lists_ipv4 in any Hiera data file [22:42:42] mutante: which part? [22:42:48] gah? [22:42:55] * JohnFLewis looks [22:42:55] the hiera-ization [22:43:40] i thought these werent applied live in production yet? (even if merged) [22:43:51] (03PS2) 10Ori.livneh: Don't assume current l10n cache files are .cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/224520 [22:43:51] they are [22:43:59] bd808: ^ [22:44:03] bleh [22:44:19] (03PS1) 10Legoktm: Revert "Set $wgCentralAuthStrict = true;" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224529 [22:44:25] (03CR) 10Legoktm: [C: 032] Revert "Set $wgCentralAuthStrict = true;" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224529 (owner: 10Legoktm) [22:44:31] (03Merged) 10jenkins-bot: Revert "Set $wgCentralAuthStrict = true;" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224529 (owner: 10Legoktm) [22:44:43] robh: you said you weren't going to merge then did :) I was thinking 'should I say something or did he change his mind;' [22:45:00] i thought you said it didnt touch production, so yea i merged... [22:45:03] i'll go ahead and revert. [22:45:12] (03PS1) 10RobH: Revert "mail: hiera-ise mailman and lists" [puppet] - 10https://gerrit.wikimedia.org/r/224530 [22:45:20] robh: or we could fix it? :) [22:45:22] !log legoktm Synchronized wmf-config: Revert "Set $wgCentralAuthStrict = true;" (duration: 00m 13s) [22:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:45:57] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item mailman::lists_ipv4 in any Hiera data file and no default supplied at /etc/puppet/manifests/role/mail.pp:84 on node sodium.wikimedia.org [22:46:04] so i think the issue is that: [22:46:13] you are trying to use a role in hiera [22:46:16] to set the variable [22:46:26] but the special "role" keyword is not used [22:46:42] on the node sodium to include it [22:46:58] (03CR) 10BryanDavis: Don't assume current l10n cache files are .cdb (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/224520 (owner: 10Ori.livneh) [22:47:14] JohnFLewis: ^ [22:47:47] https://wikitech.wikimedia.org/wiki/Puppet_Hiera#Role-based_lookup [22:47:49] hm [22:48:06] "This system, which is basically abusing puppet internals, comes with its fair share of limitations," [22:49:41] (03CR) 10Ori.livneh: Don't assume current l10n cache files are .cdb (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/224520 (owner: 10Ori.livneh) [22:50:32] (03PS3) 10Ori.livneh: Don't assume current l10n cache files are .cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/224520 [22:50:33] JohnFLewis: instead of role/common/mailman.yaml we could use /hosts/sodium.yaml and for labs the special hiera page [22:50:34] robh: fix coming in a few seconds [22:51:30] mutante: yeah might be best right now [22:52:19] mutante: actually no, real fix here [22:52:53] (03PS1) 10John F. Lewis: mail: fix location of mailman hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/224531 [22:53:04] robh: ^ johnflewis@ubuntu:~/puppet/hieradata/role$ /home/johnflewis/puppet/utils/hiera_lookup --fqdn=sodium.wikimedia.org --site=eqiad --roles=mail::lists mailman::lists_ipv4 [22:53:04] 208.80.154.4 [22:53:53] ok... i dont get theis fix. [22:53:56] all three variables work too. I just messed up the location of the hiera thing [22:54:00] ? [22:54:05] just syntax error in config? [22:54:46] oh right [22:55:02] so i take the whole thing about role-based lookup back, because it already DOES use the special role keyword [22:55:03] robh: sodium includes role mail::lists -- so hiera on sodium looks in role/common/mail/lists.yaml for the variables [22:55:07] but the name of the role was wrong [22:55:10] ahhhh [22:55:10] mutante: yes [22:55:30] (03Abandoned) 10RobH: Revert "mail: hiera-ise mailman and lists" [puppet] - 10https://gerrit.wikimedia.org/r/224530 (owner: 10RobH) [22:55:31] since the variables are called from mail::lists as well (as a role) [22:56:15] (03CR) 10John F. Lewis: "https://gerrit.wikimedia.org/r/#/c/224531/" [puppet] - 10https://gerrit.wikimedia.org/r/224530 (owner: 10RobH) [22:56:30] (so we know where the fix is anyway) [22:56:46] (03CR) 10RobH: [C: 032] mail: fix location of mailman hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/224531 (owner: 10John F. Lewis) [22:56:52] well, lets see if that fixes it, merging into use [22:57:39] usage: lldpctl [options] =P (tired of these) [22:58:00] (bad lldpctl output in every puppet run) [22:58:30] I'd like to get https://gerrit.wikimedia.org/r/224533 in as a last-minute SWAT item, if someone on my team is available to review it. [22:58:35] JohnFLewis: worked [22:58:41] ok, sodium now runs puppet ok again [22:58:49] :D [22:59:07] RECOVERY - puppet last run on sodium is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [22:59:11] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1450894 (10RobH) 3NEW [22:59:51] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1450905 (10RobH) [22:59:59] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1450894 (10RobH) [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150713T2300). Please do the needful. [23:00:05] James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:05] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1450894 (10RobH) [23:00:11] Yup. [23:00:16] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1450894 (10RobH) [23:00:24] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1450894 (10RobH) [23:00:26] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Ferm rules for mailman - https://phabricator.wikimedia.org/T104980#1450912 (10RobH) [23:00:27] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [23:00:28] I'll do it [23:01:12] (03PS3) 10Dzahn: Add a note about RCStream to irc.wikimedia.org MOTD [puppet] - 10https://gerrit.wikimedia.org/r/224242 (https://phabricator.wikimedia.org/T87780) (owner: 10Glaisher) [23:01:32] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1450894 (10RobH) [23:01:57] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1450917 (10JohnLewis) This would upgrade to mailman 2.1.18 as this is the version packaged with Debian Jessie per default. [23:02:49] robh: looks good for a task [23:03:05] I dont think the entire migrate import start over is paranoid... [23:03:10] since we cannt do a full migration in labs. [23:03:16] but others may overrule me. [23:03:35] i just dislie the idea of doing a single live import and then rushing through the 'did anything break' due to a limited window [23:03:40] dislike even. [23:04:18] it's an interesting approach but would mean we can say 'we've done it once. It went with no issues except x but we did y to solve it at the time and we can do z to prevent it which means we prevent delays in the process' :) [23:05:21] so if it's a VM and we want to test it fully before going live [23:05:40] maybe the answer is a ganeti instance.. do the import, test [23:05:42] seems even easier than bare metal ;D [23:05:52] but then use a new one for actual prod [23:05:58] well, we want the test to have the same routing capabiltiy and test it [23:06:00] but yea [23:06:16] would simply be two vms. [23:06:20] VMs. [23:06:26] or we could request 2 VMs, prod and staging [23:06:29] yea [23:08:17] I'm up for whatever :) [23:10:08] i like that, labs->staging->prod.. where staging can have private stuff [23:10:09] robh: mutante a thing before we consider a VM anyway is we need to see what specs it needs with RAM and CPUs/VCPUs/(whatever they're called now) [23:11:00] !log catrope Synchronized php-1.26wmf13/extensions/Flow/includes/Parsoid/Utils.php: Add title to Parsoid exception logging (duration: 00m 12s) [23:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:36] anyway while I remember robh: https://phabricator.wikimedia.org/T105229 -- that would have something to do with a contact def anyway won't it? [23:12:09] JohnFLewis: yes, it would have a blocker "add John as contact for service X" [23:12:18] well, i think the basic answer is no one outside of ops gets it for all services [23:12:30] and at best, we'll grant it to teams and individuals on a per service group baiss [23:12:34] was the impression i got from the meeting [23:12:34] robh: I know that part but the delegation part of it :) [23:12:51] (03PS1) 10JanZerebecki: Increase wikidata dispatch lag critical to >300s [puppet] - 10https://gerrit.wikimedia.org/r/224541 [23:12:57] so your request wasnt even addressed, sicen we denied the services team until we get it more seeded [23:13:12] as your access would be a bit more restricted than services (who get all of restbase) [23:13:20] yeah :) [23:13:57] the delgation on who updates icinga to accept those varying service groups isnt clear =P [23:14:13] I'm not being evasive, no one volunteered for it in the meeting =P [23:14:25] :P [23:14:28] imho the access should be for the services that a person also has shell for [23:14:31] granted, a thrid of the team wasnt around [23:14:40] mutante: shell and admin abiltiy [23:14:54] so if you are in an admin group that is for service X you should be an Icinga contact for X and get notified [23:15:00] and if you get notified you should be able to ACK it [23:15:04] yep, admin group, agreed. yep! [23:15:08] we're on the same page =] [23:15:35] so an admin group is a requirement now for getting alerts from icinga? or mis read? ;) [23:15:50] i think getting alerts should be the price for getting sudo rules [23:15:54] nope, but you cannot ack alerts if you arent in admin group for a service [23:16:03] robh: mkay [23:16:15] i think that makes sense [23:16:22] if you are doing the work, you should ack it in icinga [23:16:22] yeah [23:16:36] but yea, folks get pages now who have no ability to fix them [23:16:47] just because they need to know if there is an issue [23:17:05] so yea, being paged does not and should not equate out to the ability to ack them [23:17:07] (03CR) 10Hoo man: [C: 031] "Regex looks good, change in behavior makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/224541 (owner: 10JanZerebecki) [23:17:23] but if you admin a service (usergroup), you should likely be paged by icinga, and you should be able to ack said pages, as the admin of service group [23:17:37] not as recipient of a page. [23:17:42] robh: I was thinking we could kill a bird/blocker of that and add me to icinga for sodium if that doesn't bring any issues up? [23:18:19] not sure how its a blocker. [23:18:20] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1450951 (10Eevans) >>! In T78514#1450817, @GWicke wrote: > We discussed this on IRC, but didn't mention it here yet: The hist... [23:18:37] because the whole fine-grained permission thing would be based on contacts [23:18:42] ^ [23:18:47] so if you are a contact for it you get rights [23:18:58] but until we make such a change it just means he gets email [23:19:08] I think we have to fine grain icinga and give it to services before we start handing it out to others... [23:19:28] but i dont have any particular feeling against it [23:20:02] particularly strong feelings that is ;D [23:20:43] mutante: yea but isnt that not reallyt he blocker [23:20:53] isnt the blocker the service level definitions for control? [23:21:41] * JohnFLewis steps out to look at new-mailman specs [23:22:00] it's both, without either of it he won't be able to ACK [23:22:44] !log catrope Synchronized php-1.26wmf13/extensions/VisualEditor: SWAT (duration: 00m 11s) [23:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:38] Thanks, RoanKattouw [23:28:51] (03PS2) 10BBlack: Fix /static 404s in beta [puppet] - 10https://gerrit.wikimedia.org/r/224511 (https://phabricator.wikimedia.org/T105541) (owner: 10Thcipriani) [23:30:02] (03CR) 10BBlack: [C: 032] Fix /static 404s in beta [puppet] - 10https://gerrit.wikimedia.org/r/224511 (https://phabricator.wikimedia.org/T105541) (owner: 10Thcipriani) [23:32:06] (03PS2) 10Dzahn: add missing mod_rewrite rules [software/tendril] - 10https://gerrit.wikimedia.org/r/224378 (https://phabricator.wikimedia.org/T98816) (owner: 10Springle) [23:33:21] 6operations, 10SEO: GWT accounts - https://phabricator.wikimedia.org/T103567#1450995 (10Wwes) @dr0ptp4kt if @stu and @ori have been updated like myself I believe you can close this one out now. [23:34:46] (03PS3) 10BBlack: dynamicproxy: Allow proxies to be https only [puppet] - 10https://gerrit.wikimedia.org/r/218117 (https://phabricator.wikimedia.org/T102367) (owner: 10Yuvipanda) [23:36:29] (03CR) 10BBlack: [C: 032] "PS3 was just a manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/218117 (https://phabricator.wikimedia.org/T102367) (owner: 10Yuvipanda) [23:44:39] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [23:45:36] (03PS1) 10Dzahn: tendril: add missing rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/224542 (https://phabricator.wikimedia.org/T98816) [23:48:31] (03CR) 10Dzahn: "what Ori said. and yea, it should go into the main Apache config rather than a .htaccess file." [software/tendril] - 10https://gerrit.wikimedia.org/r/224378 (https://phabricator.wikimedia.org/T98816) (owner: 10Springle) [23:50:12] (03PS2) 10Dzahn: tendril: add missing rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/224542 (https://phabricator.wikimedia.org/T98816) [23:51:59] (03CR) 10Dzahn: [C: 032] "same rules already applied by .htaccess , Options FollowSymLinks is already in main config and also default" [puppet] - 10https://gerrit.wikimedia.org/r/224542 (https://phabricator.wikimedia.org/T98816) (owner: 10Dzahn) [23:53:31] (03CR) 10Dzahn: [C: 04-1] "already added here now: https://gerrit.wikimedia.org/r/#/c/224542/ and Options FollowSymLinks was already in config and is default too" [software/tendril] - 10https://gerrit.wikimedia.org/r/224378 (https://phabricator.wikimedia.org/T98816) (owner: 10Springle) [23:55:35] (03Abandoned) 10Springle: add missing mod_rewrite rules [software/tendril] - 10https://gerrit.wikimedia.org/r/224378 (https://phabricator.wikimedia.org/T98816) (owner: 10Springle) [23:58:25] (03PS6) 10Dzahn: tendril: add config template [puppet] - 10https://gerrit.wikimedia.org/r/224205 (https://phabricator.wikimedia.org/T98816)