[00:04:38] !log catrope@tin Synchronized php-1.27.0-wmf.1/extensions/Echo: SWAT (duration: 00m 17s) [00:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:04:56] !log catrope@tin Synchronized php-1.27.0-wmf.2/extensions/Echo: SWAT (duration: 00m 17s) [00:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:57] (03CR) 10MaxSem: [C: 031] "Discovery is fine with this." [dns] - 10https://gerrit.wikimedia.org/r/244078 (owner: 10Dzahn) [00:09:39] (03PS1) 10Dzahn: logstash: access to port 9200 for krypton [puppet] - 10https://gerrit.wikimedia.org/r/244095 (https://phabricator.wikimedia.org/T114836) [00:21:27] (03CR) 10Dzahn: "ran in compiler on some hosts using mariadb classes http://puppet-compiler.wmflabs.org/958/" [puppet] - 10https://gerrit.wikimedia.org/r/243852 (owner: 10Dzahn) [00:21:35] (03PS2) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 [00:28:13] (03CR) 10GWicke: [C: 031] logstash: access to port 9200 for krypton [puppet] - 10https://gerrit.wikimedia.org/r/244095 (https://phabricator.wikimedia.org/T114836) (owner: 10Dzahn) [00:29:14] (03Abandoned) 10Dzahn: put base::firewall on netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/194802 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [00:30:21] (03PS2) 10Dzahn: cdh: lint fixes - indentation [puppet/cdh] - 10https://gerrit.wikimedia.org/r/242031 [00:31:17] (03PS3) 10Dzahn: hadoop/analytics: lint fixes - indentation [puppet/cdh] - 10https://gerrit.wikimedia.org/r/242031 [00:33:04] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:40:16] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/959/analytics1027.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/241318 (owner: 10Dzahn) [00:45:00] (03CR) 10Dzahn: "where does the error even come from? Syntax error at '<<'; expected '}' at /mnt/jenkins-workspace/puppet-compiler/959/change/src/manifests" [puppet] - 10https://gerrit.wikimedia.org/r/241318 (owner: 10Dzahn) [01:41:56] (03CR) 10Alex Monk: Add all groups to general bastions, mostly empty bastiononly group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227327 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [01:44:16] !log catrope@tin Synchronized php-1.27.0-wmf.1/extensions/Echo: Fix JS error (duration: 00m 17s) [01:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:44:35] !log catrope@tin Synchronized php-1.27.0-wmf.2/extensions/Echo: Fix JS error (duration: 00m 18s) [01:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:47:12] (03CR) 10Alex Monk: "What if we:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/243357 (owner: 10Alex Monk) [01:47:45] (03PS1) 10Dzahn: deactivate wikimemory.org [dns] - 10https://gerrit.wikimedia.org/r/244101 [01:51:06] Krenair: did you know wikijunior.org [01:51:26] I might have read it in some dns file before [01:51:32] but I don't know anything about it [01:51:49] well, it redirects you to a project of en.wikibooks.org [01:51:58] the question will be at some point.. is the domain name worth it [01:53:01] as you might have seen i am suggesting to deactivate a bunch of domains.. some are more obvious than other ones [01:54:04] "Wikijunior is a part of the Wikimedia Foundation and Wikibooks and is subject to their rules and policies. Wikijunior also has its own set of guidelines in addition to those. Sometimes Wikijunior policy differs from policy on the rest of Wikibooks —" [01:54:22] uhm, yea.. that's like almost a separate project but not 100% :p [01:56:36] wikimedia.us - should be an overview page that links to all US chapters.. or deactivate [01:58:36] (03PS1) 10Dzahn: deactivate wikimania.asia [dns] - 10https://gerrit.wikimedia.org/r/244103 [01:59:51] (03PS1) 10Dzahn: deactivate wikiknihy.cz [dns] - 10https://gerrit.wikimedia.org/r/244104 [02:01:24] (03CR) 10Dzahn: "@andre__ you are from .cz , what is this for?" [dns] - 10https://gerrit.wikimedia.org/r/244104 (owner: 10Dzahn) [02:10:31] Gerrit is reaaaaaaaaaaally slow today [02:10:39] (command line [02:11:29] Krinkle: I blame Reedy. It's always his fault when gerrit is slow. :-) [02:12:53] mutante: I think an overview would be good, but someone would have to build it. [02:14:02] James_F: with content included from wiki? [02:14:22] to revert the move of the www.wikipedia portal :p [02:14:22] mutante: I think we could do it statically; it's incredibly rare it changes. [02:14:44] mutante: Personally I'd just redirect to wikimediadc.org. :-) [02:15:23] James_F: nowadays the question is always if we want to pay for the cert issue because we want to be https-only and so even the redirects are not free anymore [02:15:33] Oh, yeah. [02:15:40] Well, redirect to someone else's server. [02:15:49] But yeah. [02:15:55] We need a *.* cert. [02:16:00] Aka, CA status. [02:16:15] yea. and own the entire .wiki :p [02:16:19] * mutante hides [02:16:26] James_F: hey, there's a business model! [02:16:39] * James_F grins. [02:16:44] good point, instead of banners? [02:16:49] yep! sell certs! [02:16:55] and .wiki domains we dont need [02:17:06] to the highest bidder :p [02:17:07] * greg-g calls up the board's red phone [02:17:15] "Hello, I've solved our problem." [02:17:31] :p [02:17:38] * greg-g goes to find foo [02:17:39] d [02:18:04] but letsencrypt comes to the rescue soon.. hmm..right [02:21:42] ok, see you later. afk [02:31:40] (03CR) 10Krinkle: Set page purge limiting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243363 (owner: 10Aaron Schulz) [02:32:44] !log l10nupdate@tin Synchronized php-1.27.0-wmf.1/cache/l10n: l10nupdate for 1.27.0-wmf.1 (duration: 08m 22s) [02:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:43] (03PS2) 10Aaron Schulz: Set page purge limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243363 [02:37:32] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.1) at 2015-10-07 02:37:32+00:00 [02:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:09] (03PS3) 10Aaron Schulz: Set page purge limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243363 [03:03:46] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 10m 18s) [03:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:08:12] (03CR) 10Krinkle: [C: 031] Set page purge limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243363 (owner: 10Aaron Schulz) [03:08:41] (03PS2) 10EBernhardson: Revert "Revert "Update CirrusSearch config for testwiki to talk to second cluster"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244077 [03:10:02] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-07 03:10:02+00:00 [03:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:19:07] (03PS3) 10EBernhardson: Revert "Revert "Update CirrusSearch config for testwiki to talk to second cluster"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244077 [03:21:19] (03PS4) 10EBernhardson: Revert "Revert "Update CirrusSearch config for testwiki to talk to second cluster"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244077 [03:24:05] although testwiki requests always run on mw1017, i'm betting job's issued by testwiki are still always run on mw10{01..16} ? [03:29:13] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 62383 bytes in 0.116 second response time [03:34:19] (03CR) 10EBernhardson: [C: 032] Revert "Revert "Update CirrusSearch config for testwiki to talk to second cluster"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244077 (owner: 10EBernhardson) [03:34:26] (03Merged) 10jenkins-bot: Revert "Revert "Update CirrusSearch config for testwiki to talk to second cluster"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244077 (owner: 10EBernhardson) [03:35:17] !log ebernhardson@tin Synchronized wmf-config/: Reenable second ES cluster on testwiki only (duration: 00m 18s) [03:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:36:30] (03PS1) 10EBernhardson: Revert "Revert "Revert "Update CirrusSearch config for testwiki to talk to second cluster""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244110 [03:36:36] (03CR) 10EBernhardson: [C: 032] Revert "Revert "Revert "Update CirrusSearch config for testwiki to talk to second cluster""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244110 (owner: 10EBernhardson) [03:36:42] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Update CirrusSearch config for testwiki to talk to second cluster""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244110 (owner: 10EBernhardson) [03:37:12] !log ebernhardson@tin Synchronized wmf-config: redisable second cluster only on testwiki (duration: 00m 16s) [03:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:07:38] (03PS1) 10EBernhardson: Revert "Revert "Revert "Revert "Update CirrusSearch config for testwiki to talk to second cluster"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244113 [04:15:48] !log sync-common on mw2187.codfw.wmnet to fix localisation cache errors in exception.log [04:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:16:02] RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 64383 bytes in 2.923 second response time [04:16:18] :) [04:17:04] RECOVERY - Apache HTTP on mw2187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.968 second response time [04:22:24] (03CR) 10EBernhardson: [C: 032] Revert "Revert "Revert "Revert "Update CirrusSearch config for testwiki to talk to second cluster"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244113 (owner: 10EBernhardson) [04:22:30] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Revert "Update CirrusSearch config for testwiki to talk to second cluster"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244113 (owner: 10EBernhardson) [04:23:14] ebernhardson, infinite recursion detected [04:23:19] MaxSem: indeed [04:23:45] !log ebernhardson@tin Synchronized wmf-config/: enable second es cluster in testwiki one more time (duration: 00m 18s) [04:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:24:05] * ebernhardson sighs [04:24:09] this doesn't make sense [04:24:28] the only possibility i could find was that the caching wasn't working right, but it seems fine [04:29:41] ok, verified it is a cachine issue :S [04:33:30] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Touch InitiialiSettings.php to force config regeneration (duration: 00m 18s) [04:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:34:04] MaxSem: the whole damn problem was simply that InitialiseSettings.php had an old timestamp :( [04:34:21] poop [04:36:53] on the upside, writes look to be going to both clusters :) [04:39:47] 6operations, 5Patch-For-Review: setup / deploy nobelium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1707789 (10EBernhardson) Deployed a patch to mediawiki-config, testwiki is now writing to the labsearch cluster. Will enable more wikis tomorrow [04:48:59] 6operations, 10OTRS, 6Security, 5Patch-For-Review: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#1707801 (10Keegan) [04:57:50] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1707814 (10GWicke) It looks like there were some small spikes in html2wt requests: {F2663368} [04:58:57] (03PS1) 10Dzahn: add template for 'mailonly' domains [dns] - 10https://gerrit.wikimedia.org/r/244115 [05:03:42] (03CR) 10Dzahn: "i hope this can go on a puppet swat now that it's down to just mediawiki" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [05:04:01] 6operations, 5Patch-For-Review: setup / deploy nobelium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1707819 (10EBernhardson) Fun things I forgot to deal with: Different clusters need different numbers of shards / replicas [05:07:39] (03PS1) 10Dzahn: Revert "Revert "admin: Allow aklapper to reset user auths and delete accounts in Phab"" [puppet] - 10https://gerrit.wikimedia.org/r/244116 [05:08:23] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1707825 (10Dzahn) @chasemp what was the potential problem that @yuvipanda mentioned. @aklapper do you have time now? how can i help? all: time for revert revert... [05:18:43] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 7 others: Standardise CXServer deployment - https://phabricator.wikimedia.org/T101272#1707839 (10santhosh) a:3santhosh [05:25:03] 6operations, 7Icinga: improve icinga performance / solve general load issues on neon - https://phabricator.wikimedia.org/T85222#1707844 (10Dzahn) [05:25:05] 6operations, 7Monitoring: Monitor all mgmt hosts - https://phabricator.wikimedia.org/T85143#1707843 (10Dzahn) [05:25:45] !log generating elasticsearch indices in codfw, should run ~3 hours [05:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:30:02] 6operations, 7Icinga: improve icinga performance / solve general load issues on neon - https://phabricator.wikimedia.org/T85222#1707850 (10Dzahn) [05:30:04] 6operations, 7Monitoring: icinga "max concurrent checks" limits reached - https://phabricator.wikimedia.org/T1242#1707849 (10Dzahn) [05:39:08] 6operations, 10Gitblit, 7Monitoring: Improve monitoring of https://git.wikimedia.org/ - https://phabricator.wikimedia.org/T94320#1707860 (10Dzahn) The monitoring isn't the problem, the service is :p This ticket as it currently stands is resolved and has been a long time. I once suggested to [[ https://ger... [05:43:27] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1707862 (10GWicke) Pagebundle request rates around the time of the outage: {F2663471} [05:45:12] 6operations, 10Gitblit, 7Monitoring: Improve monitoring of https://git.wikimedia.org/ - https://phabricator.wikimedia.org/T94320#1707863 (10Dzahn) 5stalled>3Resolved a:3Dzahn monitoring and notifications works: example from IRC: 17:35 < icinga-wm> PROBLEM - git.wikimedia.org on antimony is CRITICAL:... [05:54:04] (03PS3) 10Dzahn: lvs: double quoted string and other lint [puppet] - 10https://gerrit.wikimedia.org/r/243856 [05:54:27] (03PS3) 10Dzahn: lint: double quoted strings pt.3 [puppet] - 10https://gerrit.wikimedia.org/r/243855 [06:01:01] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Oct 7 06:01:01 UTC 2015 (duration 1m 0s) [06:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:29:25] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [06:30:13] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:52] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:04] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:33] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:23] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:22] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:16] (03PS1) 10Yuvipanda: labstore: Add a delete-dbuser script [puppet] - 10https://gerrit.wikimedia.org/r/244120 [06:34:42] puppet is so scared of _joe_ it fails every time he wakes up [06:34:59] (03PS2) 10Yuvipanda: labstore: Add a delete-dbuser script [puppet] - 10https://gerrit.wikimedia.org/r/244120 [06:36:59] hahaha [06:43:05] (03CR) 10EBernhardson: "the non-failure messages going to the failure channel is being addressed in https://gerrit.wikimedia.org/r/#/c/243956/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [06:44:07] (03CR) 10Muehlenhoff: [C: 04-1] "This also needs a rule for Apache on port 80." [puppet] - 10https://gerrit.wikimedia.org/r/223887 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [06:49:53] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:14] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:32] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [06:54:52] (03PS1) 10Muehlenhoff: irc.wikimedia.org: Add ferm rule for Apache [puppet] - 10https://gerrit.wikimedia.org/r/244122 (https://phabricator.wikimedia.org/T104943) [06:56:33] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1707933 (10GWicke) Here is a break-down of parsoid requests failing with ETIMEDOUT in the outage period, from the... [06:56:33] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:57:04] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:58:03] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:31] (03CR) 10Muehlenhoff: [C: 031] mw-rc-irc: firewall hole for RC IRC bot [puppet] - 10https://gerrit.wikimedia.org/r/244068 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [07:11:01] (03CR) 10Muehlenhoff: [C: 031] logstash: access to port 9200 for krypton [puppet] - 10https://gerrit.wikimedia.org/r/244095 (https://phabricator.wikimedia.org/T114836) (owner: 10Dzahn) [07:18:02] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:49:52] (03CR) 10Alexandros Kosiaris: [C: 031] cassandra: new metrics-collector version [puppet] - 10https://gerrit.wikimedia.org/r/243127 (https://phabricator.wikimedia.org/T113733) (owner: 10Filippo Giunchedi) [07:56:33] (03CR) 10Alexandros Kosiaris: [C: 032] "Merging" [puppet] - 10https://gerrit.wikimedia.org/r/243400 (owner: 10Cscott) [07:56:38] (03PS2) 10Alexandros Kosiaris: Update cxserver Parsoid configuration. [puppet] - 10https://gerrit.wikimedia.org/r/243400 (owner: 10Cscott) [07:56:41] (03CR) 10Alexandros Kosiaris: [V: 032] Update cxserver Parsoid configuration. [puppet] - 10https://gerrit.wikimedia.org/r/243400 (owner: 10Cscott) [07:57:42] akosiaris: thanks for ^ [07:58:24] kart_: yw. btw, cxserver is not using parsoid anymore, is it ? [07:58:59] it now uses restbase ? or not (yet) ? [08:02:15] (03CR) 10Alexandros Kosiaris: [C: 032] varnish: add 'incubator' to maps-frontend regex [puppet] - 10https://gerrit.wikimedia.org/r/244070 (https://phabricator.wikimedia.org/T113122) (owner: 10Dzahn) [08:02:22] (03PS2) 10Alexandros Kosiaris: varnish: add 'incubator' to maps-frontend regex [puppet] - 10https://gerrit.wikimedia.org/r/244070 (https://phabricator.wikimedia.org/T113122) (owner: 10Dzahn) [08:02:26] (03CR) 10Alexandros Kosiaris: [V: 032] varnish: add 'incubator' to maps-frontend regex [puppet] - 10https://gerrit.wikimedia.org/r/244070 (https://phabricator.wikimedia.org/T113122) (owner: 10Dzahn) [08:04:22] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100 [08:04:38] 6operations, 6Discovery, 10Maps, 10Traffic, 5Patch-For-Review: maps: support wikivoyages in incubator - https://phabricator.wikimedia.org/T113122#1707982 (10akosiaris) Change merged, anything else to do here or is this ready to be resolved ? [08:05:13] !log installed spice security updates on labvirt* [08:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:09:00] (03CR) 10Alexandros Kosiaris: [C: 04-1] Specify SSHD listen address for lvs hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243982 (owner: 10Rush) [08:15:43] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL: CRITICAL: 16.00% of data above the critical threshold [100000000.0] [08:15:52] PROBLEM - puppet last run on mw2160 is CRITICAL: CRITICAL: puppet fail [08:23:12] (03CR) 10Filippo Giunchedi: [C: 031] memcached: move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/243652 (owner: 10Muehlenhoff) [08:26:24] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100 [08:35:37] (03CR) 10Alexandros Kosiaris: [C: 032] contint: remove pylint/pyflakes packages [puppet] - 10https://gerrit.wikimedia.org/r/243915 (https://phabricator.wikimedia.org/T114360) (owner: 10Hashar) [08:35:44] (03PS2) 10Alexandros Kosiaris: contint: remove pylint/pyflakes packages [puppet] - 10https://gerrit.wikimedia.org/r/243915 (https://phabricator.wikimedia.org/T114360) (owner: 10Hashar) [08:38:51] 6operations: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1708040 (10fgiunchedi) 3NEW [08:39:53] RECOVERY - Outgoing network saturation on labstore1002 is OK: OK: Less than 10.00% above the threshold [75000000.0] [08:41:58] Good morning akosiaris [08:42:02] Are you around ? [08:43:13] RECOVERY - puppet last run on mw2160 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:44:54] !log disable puppet on restbase, maps, aqs before merging https://gerrit.wikimedia.org/r/#/c/243127 [08:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:45:14] (03PS4) 10Filippo Giunchedi: cassandra: new metrics-collector version [puppet] - 10https://gerrit.wikimedia.org/r/243127 (https://phabricator.wikimedia.org/T113733) [08:45:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: new metrics-collector version [puppet] - 10https://gerrit.wikimedia.org/r/243127 (https://phabricator.wikimedia.org/T113733) (owner: 10Filippo Giunchedi) [08:46:31] joal: yup [08:46:34] (03PS3) 10Muehlenhoff: memcached: move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/243652 [08:46:38] joal: good morning to you too [08:46:48] akosiaris: :) [08:47:21] Seems don't work as planned akosiaris --> hadoop asks me for port 9160 to be openned to access cassandra :( [08:48:13] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: puppet fail [08:48:39] hadoop ? I have probably misunderstood the architecture here. what does hadoop have to do with the entire thing ? isn't there a client that talks to hadoop, gets the data from hadoop and then pushes them to cassandra ? [08:49:01] akosiaris: it's a hadoop job that pushes the data to cassandra [08:49:43] ok then, sounds like some component that hadoop needs to talk to cassandra needs updating [08:50:05] cause if it wants port 9160, then it wants the old thrift (aka RPC) protocol and that's deprecated [08:50:49] joal: mind pasting me a link to the code that does all that? I remember something about CQL3 (native protocol) in that code [08:50:50] I remember we talked about that already, and found that normally the dependencies I use use the new CQL protocal [08:51:03] Yup :) [08:51:39] akosiaris: https://gerrit.wikimedia.org/r/#/c/232448/ [08:51:55] akosiaris: more precisely: https://gerrit.wikimedia.org/r/#/c/232448/1/refinery-job/src/main/java/org/wikimedia/analytics/refinery/job/CassandraXSVLoader.java [08:52:36] Cassandra dependency in pom is version 2.1.7 --> I need to upgrade to 2.1.8 [08:53:06] But except from that :( [08:53:18] http://mvnrepository.com/artifact/com.datastax.cassandra [08:53:32] (03PS1) 10Filippo Giunchedi: cassandra: simplify $blacklist validation [puppet] - 10https://gerrit.wikimedia.org/r/244126 [08:53:32] cassandra-driver-core (57) [08:53:32] DataStax Java Driver For Apache Cassandra Core1533.0.0-alpha3A driver for Apache Cassandra 1.2+ that works exclusively with the Cassandra Query Language version 3 (CQL3) and Cassandra's binary protocol. Cassandra ClientsCassandra Drivers [08:53:37] seems like it is alpha ? [08:54:01] seriously ? grrrr [08:54:19] http://mvnrepository.com/artifact/com.datastax.cassandra/cassandra-driver-core --> version 2.1.8 is releaseed [08:55:05] A driver for Apache Cassandra 1.2+ that works exclusively with the Cassandra Query Language version 3 (CQL3) and Cassandra's binary protocol. [08:55:11] same here for 2.1.8 [08:55:16] ok then [08:55:23] how come it requires thrift then ? [08:55:27] that does not sound right [08:55:30] I have no idea :( [08:56:29] (03CR) 10Alexandros Kosiaris: [C: 031] cassandra: simplify $blacklist validation [puppet] - 10https://gerrit.wikimedia.org/r/244126 (owner: 10Filippo Giunchedi) [08:56:43] hmm [08:57:24] wait [08:57:39] org.apache.cassandra.hadoop.cql3.CqlConfigHelper; [08:57:50] that's not com.datastax.cassandra ? [08:57:59] so, 2 different drivers ? [08:58:08] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: simplify $blacklist validation [puppet] - 10https://gerrit.wikimedia.org/r/244126 (owner: 10Filippo Giunchedi) [08:58:09] kind of confused now [08:58:43] There are two things: cassandra-driver - datastax, and cassandra-hadoop integration that uses the driver -- org.apache.cassandra [09:00:14] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:03:29] akosiaris: As stated in here: https://github.com/apache/cassandra/tree/trunk/examples/hadoop_cql3_word_count, the integration I made should use CQL3 :( [09:04:54] akosiaris: but from the build.xml file in that repo, it looks it is using thrift as well https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/build.xml#L75 [09:05:44] akosiaris: also: https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/src/WordCount.java#L248 [09:05:47] looks bad :( [09:05:51] Maaaaarf :( [09:06:24] joal: L248 is in an else [09:06:24] https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/src/WordCount.java#L238 [09:06:56] so, "native" for input mapper type ? whatever that thing is ? [09:07:59] so, the build.xml probably is using thrift as well to support both types [09:08:08] akosiaris: What I understand from that is that there is either native or thrift mappers (getting data from cassandra) [09:08:14] but the actual choice is happening in the code [09:08:16] (03PS4) 10Muehlenhoff: memcached: move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/243652 [09:08:24] But there is only one reducer writing data to it [09:10:12] (03CR) 10Muehlenhoff: [C: 032 V: 032] memcached: move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/243652 (owner: 10Muehlenhoff) [09:12:10] joal: the reducer is this part https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/src/WordCount.java#L218 , right? [09:12:18] I see it uses native CQL3 [09:12:28] akosiaris: it should, yes [09:12:29] https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/src/WordCount.java#L231 [09:12:49] akosiaris: currently reading more in details the CqlOutputFormat to try to make sense of it [09:14:07] akosiaris: this is the error I got: https://hue.wikimedia.org/jobbrowser/jobs/job_1441303822549_78768/job_attempt_logs/0 [09:14:11] in stderr tab [09:14:59] joal: I have no account for hue [09:15:03] how do I create one ? [09:15:26] hm, I think ottomata does a manual trick [09:15:30] I'll create a gist [09:15:57] akosiaris: https://gist.github.com/jobar/420cc90352f720d95f8a [09:20:18] joal: Caused by: java.io.IOException: Unable to connect to server aqs1001.eqiad.wmnet:9160 at org.apache.cassandra.hadoop.ConfigHelper.createConnection(ConfigHelper.java:574) [09:20:28] note the org.apache.cassandra.hadoop.ConfigHelper [09:20:50] whereas it should be talking about org.apache.cassandra.hadoop.cql3.CqlConfigHelper; ? [09:20:55] at least I think so [09:21:11] could be wrong [09:21:57] wait, it might be a wrapper class [09:23:13] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 26 data above and 9 below the confidence bounds [09:26:12] (03PS1) 10Muehlenhoff: archiva: move ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/244129 [09:26:13] akosiaris: You get a point ! [09:28:22] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 27 data above and 9 below the confidence bounds [09:28:41] akosiaris: I can't use CqlConfigHelper only though :( [09:30:02] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [09:32:02] (03CR) 10Daniel Kinzler: "@QChris Ugh. Is the reliance on $1 documented somewhere?" [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [09:33:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 31 data above and 9 below the confidence bounds [09:35:12] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 637 [09:37:12] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 9 below the confidence bounds [09:38:08] akosiaris: From the code, seems that the hadoop stuf is using CQL for insterting data, but thrift to retrieve cassandra ring configuration :( [09:40:12] RECOVERY - check_mysql on db1008 is OK: Uptime: 6454341 Threads: 1 Questions: 46505466 Slow queries: 43904 Opens: 107279 Flush tables: 2 Open tables: 64 Queries per second avg: 7.205 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:42:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 31 data above and 9 below the confidence bounds [09:46:52] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:47:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 32 data above and 8 below the confidence bounds [09:49:00] joal: git lg cassandra-2.1.7..cassandra-2.2.2 |wc -l [09:49:00] 3013 [09:49:15] just wondering why not use a newer version of that driver [09:49:37] that part seems to have changed quite a lot between the 2 versions btw [09:49:57] :~/cassandra-trunk/src/java/org/apache/cassandra/hadoop/cql3$ git lg cassandra-2.1.7..cassandra-2.2.2 . |wc -l [09:49:57] 2788 [09:50:22] akosiaris: I just wanted to be coherent with the cassandra version we are using [09:50:31] I can try with the new version though [09:50:50] hmm, there's a link between the 2 versions naming ? [09:51:25] ? [09:52:03] hmm [09:52:04] reading a bit of the code from 2.2.2, seems closer to what we want (NativeRingCache instead of RingCache for instance) [09:52:07] (03CR) 10Aklapper: "I might I need more context to answer... Literally the domain name translates to "wikibooks". http://wikiknihy.cz/ redirects to http://cs." [dns] - 10https://gerrit.wikimedia.org/r/244104 (owner: 10Dzahn) [09:52:16] so, that repo has everything in it [09:52:26] server code, client libraries... [09:52:51] even python libraries [09:54:07] akosiaris: trying to make a jar and use it [09:55:04] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:10:20] <_joe_> !log depooling cp1059 from pybal, varnish [10:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:31] (03CR) 10Alexandros Kosiaris: [C: 031] cassandra: enable multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/243675 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [10:11:57] (03PS1) 10ArielGlenn: salt master: change secondary timeout from 5 to 10 seconds [puppet] - 10https://gerrit.wikimedia.org/r/244133 [10:12:33] (03PS3) 10Filippo Giunchedi: cassandra: enable multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/243675 (https://phabricator.wikimedia.org/T95253) [10:12:38] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: enable multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/243675 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [10:13:08] (03PS2) 10ArielGlenn: salt master: change secondary timeout from 5 to 10 seconds [puppet] - 10https://gerrit.wikimedia.org/r/244133 [10:14:02] (03CR) 10ArielGlenn: [C: 032] salt master: change secondary timeout from 5 to 10 seconds [puppet] - 10https://gerrit.wikimedia.org/r/244133 (owner: 10ArielGlenn) [10:16:52] PROBLEM - puppet last run on mw2109 is CRITICAL: CRITICAL: puppet fail [10:18:22] (03PS1) 10ArielGlenn: salt master config fix stupid typo [puppet] - 10https://gerrit.wikimedia.org/r/244134 [10:18:27] (03CR) 10jenkins-bot: [V: 04-1] salt master config fix stupid typo [puppet] - 10https://gerrit.wikimedia.org/r/244134 (owner: 10ArielGlenn) [10:18:46] (03PS2) 10ArielGlenn: salt master config fix stupid typo [puppet] - 10https://gerrit.wikimedia.org/r/244134 [10:19:32] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:19:49] (03CR) 10ArielGlenn: [C: 032] salt master config fix stupid typo [puppet] - 10https://gerrit.wikimedia.org/r/244134 (owner: 10ArielGlenn) [10:20:53] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures [10:26:07] 6operations, 10ops-eqiad: cp1059 has network issues - https://phabricator.wikimedia.org/T114870#1708254 (10Joe) 3NEW [10:28:45] (03PS1) 10Filippo Giunchedi: cassandra: conditionally add rpc_address to rpc_interface [puppet] - 10https://gerrit.wikimedia.org/r/244135 [10:36:14] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:38:58] (03PS1) 10Filippo Giunchedi: cassandra: fix config_directory ownership [puppet] - 10https://gerrit.wikimedia.org/r/244136 [10:39:44] (03PS2) 10Filippo Giunchedi: cassandra: fix config_directory ownership [puppet] - 10https://gerrit.wikimedia.org/r/244136 [10:39:55] (03PS3) 10Filippo Giunchedi: cassandra: fix config_directory ownership [puppet] - 10https://gerrit.wikimedia.org/r/244136 [10:40:01] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: fix config_directory ownership [puppet] - 10https://gerrit.wikimedia.org/r/244136 (owner: 10Filippo Giunchedi) [10:41:02] (03CR) 10Muehlenhoff: "The idea was to have the ability to create subsets of systems which can be updated in advance, so e.g. declare all of labvirt* as debdeplo" [puppet] - 10https://gerrit.wikimedia.org/r/243142 (https://phabricator.wikimedia.org/T111006) (owner: 10Muehlenhoff) [10:41:33] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:44:06] (03PS1) 10ArielGlenn: salt master: turn off pillar and grain cache for minions [puppet] - 10https://gerrit.wikimedia.org/r/244137 [10:44:39] (03PS2) 10Filippo Giunchedi: cassandra: conditionally add rpc_address to rpc_interface [puppet] - 10https://gerrit.wikimedia.org/r/244135 [10:44:46] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: conditionally add rpc_address to rpc_interface [puppet] - 10https://gerrit.wikimedia.org/r/244135 (owner: 10Filippo Giunchedi) [10:44:56] (03PS2) 10ArielGlenn: salt master: turn off pillar and grain cache for minions [puppet] - 10https://gerrit.wikimedia.org/r/244137 [10:45:12] RECOVERY - puppet last run on mw2109 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [10:46:07] (03CR) 10ArielGlenn: [C: 032] salt master: turn off pillar and grain cache for minions [puppet] - 10https://gerrit.wikimedia.org/r/244137 (owner: 10ArielGlenn) [10:48:50] (03PS1) 10Giuseppe Lavagetto: Convert logging from print to twisted.logger [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 [10:49:00] <_joe_> paravoid: ^^ you asked for it :) [10:49:05] 6operations, 6Discovery, 10Maps, 10Traffic, 5Patch-For-Review: maps: support wikivoyages in incubator - https://phabricator.wikimedia.org/T113122#1708279 (10Yurik) 5Open>3Resolved a:3Yurik [10:49:20] (03CR) 10jenkins-bot: [V: 04-1] Convert logging from print to twisted.logger [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 (owner: 10Giuseppe Lavagetto) [10:52:43] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [10:54:29] why? [10:55:25] !log reenable puppet on restbase / maps-test / aqs [10:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:38] _joe_: perhaps you could add rationale to the commit message [11:04:54] <_joe_> mark: heh, yes, it's a WIP in fact [11:05:14] <_joe_> I should've done git review -D [11:05:23] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [11:05:24] <_joe_> or add [WiP] there [11:14:35] <_joe_> mark: first of all, I forgot to check if the new twisted.logger was available in jessie, and turns out it's not. Even if jessie is on twisted 14 [11:17:07] (03PS1) 10TTO: Disable title blacklist on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) [11:19:21] (03PS1) 10TTO: Revert "Route Bug40009 logs to fluorine" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244141 [11:19:24] (03CR) 10jenkins-bot: [V: 04-1] Revert "Route Bug40009 logs to fluorine" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244141 (owner: 10TTO) [11:20:25] (03CR) 10Hoo man: [C: 031] Disable title blacklist on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [11:22:26] (03CR) 10TTO: [C: 04-1] "Actually they might want to keep TB enabled locally, just remove the meta source" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [11:24:52] (03PS2) 10TTO: Exempt private/fishbowl wikis from the global title blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) [11:25:37] (03PS1) 10KartikMistry: Enable CX suggestions in ast, bn, ml, nb, ta and ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244142 (https://phabricator.wikimedia.org/T112848) [11:37:52] (03CR) 10Alex Monk: Exempt private/fishbowl wikis from the global title blacklist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [11:44:44] (03PS3) 10TTO: Exempt private/fishbowl wikis from the global title blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) [11:52:36] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1708461 (10Steinsplitter) WOW @VictorGrigas Cool photos. Thanks a lot! [12:11:44] (03CR) 10Daniel Kinzler: [C: 031] "We want this, and the implementation *looks* correct" [puppet] - 10https://gerrit.wikimedia.org/r/238396 (https://phabricator.wikimedia.org/T111015) (owner: 10Bene) [12:12:34] how to realize that apache cassandra ppl are not using ant [12:12:40] try to build the project... [12:12:54] akosiaris: I managed to get a job running [12:13:00] error: unmappable character for encoding ASCII [12:13:10] joal: awesome!!! [12:13:13] using version 2.2.2 of org.apache.cassandra [12:13:23] yes!!! that's the news I wanted to hear [12:13:26] and 2.2.0-rc3 of driver (how wrong) [12:13:35] rc3 ? [12:13:38] lol [12:13:40] how come ? [12:13:42] akosiaris: I'm still fighting with hue integration, but getting closer every minute :) [12:14:04] akosiaris: no f.cking idea, and I'm not sure I want to know more :-D [12:14:13] lol [12:14:23] yeah, understandable [12:14:38] akosiaris: not hue integration sorry, oozie [12:14:43] but yeah, getting close :) [12:15:35] well, 2788 commits should do it indeed [12:19:42] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:20:14] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:21:03] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:25:23] PROBLEM - puppet last run on elastic1012 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:27:03] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:27:11] !log salt-rm'ing /var/lib/apt/lists/ubuntu.wikimedia.org_ubuntu_dists_trusty_main_i18n_Translation-en%5fUS [12:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:53] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:30:03] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:30:04] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:30:33] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:30:42] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:30:52] PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:31:04] PROBLEM - puppet last run on mw1151 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:31:46] ? [12:33:42] PROBLEM - puppet last run on wtp2006 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:33:43] PROBLEM - puppet last run on mw1105 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:34:13] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:35:02] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:35:24] RECOVERY - puppet last run on mw1105 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:36:05] something weird happened with the ubuntu mirror [12:36:12] well, not our mirror, probably upstream [12:36:31] first it started failing, that recovered, but those files got corrupted (empty gzips) across a few machines [12:36:58] (and the failing apt-get update means puppet doesn't run because puppet-run is set -e, hence those alerts) [12:37:01] bblack: ^ [12:44:49] oh nice [12:51:12] 6operations, 10Continuous-Integration-Infrastructure: Phase out operations-puppet-pep8 Jenkins job and tools/puppet_pep8.py - https://phabricator.wikimedia.org/T114887#1708567 (10hashar) 3NEW a:3hashar [12:54:53] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [12:55:02] RECOVERY - puppet last run on elastic1012 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:55:13] (03PS2) 10Filippo Giunchedi: install_server: cassandra to /srv for 2 ssd hosts [puppet] - 10https://gerrit.wikimedia.org/r/242098 (https://phabricator.wikimedia.org/T113714) [12:55:46] (03PS3) 10Filippo Giunchedi: install_server: cassandra to /srv for 2 ssd hosts [puppet] - 10https://gerrit.wikimedia.org/r/242098 (https://phabricator.wikimedia.org/T113714) [12:55:54] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:55:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: cassandra to /srv for 2 ssd hosts [puppet] - 10https://gerrit.wikimedia.org/r/242098 (https://phabricator.wikimedia.org/T113714) (owner: 10Filippo Giunchedi) [12:56:33] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:57:30] (03PS2) 10Filippo Giunchedi: cassandra: add restbase-test2001 instances [puppet] - 10https://gerrit.wikimedia.org/r/243944 (https://phabricator.wikimedia.org/T95253) [12:57:43] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [12:57:53] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:00:12] !log decomission restbase-test2001 and reimage [13:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:00:23] RECOVERY - puppet last run on mw1102 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:00:42] RECOVERY - puppet last run on mw1151 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:05:22] PROBLEM - Cassandra CQL query interface on restbase-test2001 is CRITICAL: Connection refused [13:06:40] that's me ^ silencing [13:06:43] (03PS1) 10Hashar: swift: fix some alignement in rewrite.py [puppet] - 10https://gerrit.wikimedia.org/r/244147 (https://phabricator.wikimedia.org/T114887) [13:07:52] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Puppet has 1 failures [13:08:43] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exports is active [13:13:14] (03PS1) 10Hashar: tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) [13:13:28] (03CR) 10Hashar: [C: 04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [13:13:58] (03CR) 10jenkins-bot: [V: 04-1] tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [13:14:52] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [13:19:23] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:20:34] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:22:03] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [13:27:33] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:41] (03PS3) 10Filippo Giunchedi: cassandra: add restbase-test2001 instances [puppet] - 10https://gerrit.wikimedia.org/r/243944 (https://phabricator.wikimedia.org/T95253) [13:29:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase-test2001 instances [puppet] - 10https://gerrit.wikimedia.org/r/243944 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [13:31:12] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [13:33:03] RECOVERY - puppet last run on wtp2006 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [13:33:13] PROBLEM - puppet last run on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:33] PROBLEM - YARN NodeManager Node-State on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:02] RECOVERY - puppet last run on analytics1017 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures [13:37:10] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure: Clarify the salt version to use on beta cluster - https://phabricator.wikimedia.org/T114755#1708677 (10hashar) I have no idea why http://debian.saltstack.com/debian/ has been added. Seem it was a manual change related to testing the 2015.5... [13:37:52] RECOVERY - YARN NodeManager Node-State on analytics1017 is OK: OK: YARN NodeManager analytics1017.eqiad.wmnet:8041 Node-State: RUNNING [13:41:10] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure: Clarify the salt version to use on beta cluster - https://phabricator.wikimedia.org/T114755#1708682 (10ArielGlenn) we run 2014.7.5 and we run it on precise through jessie both in production and in labs. 2015.5 runs nowhere and I have not e... [13:41:31] (03CR) 10Rush: Specify SSHD listen address for lvs hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243982 (owner: 10Rush) [13:42:45] (03PS2) 10Rush: Specify SSHD listen address for lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/243982 [13:43:02] (03PS3) 10Rush: Specify SSHD listen address for lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/243982 [13:43:04] apergos: I am wondering what happened with salt / debian repo :-D will clean it up [13:43:17] yeah I'm wondering that too. [13:43:31] unless yuvi has some idea I would toss that crap right now [13:43:54] going to remove it [13:43:57] sweet [13:43:58] (03PS4) 10Rush: Specify SSHD listen address for lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/243982 [13:45:18] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1708700 (10VictorGrigas) Welcome! Happy to share [13:47:08] (03PS1) 10Filippo Giunchedi: cassandra: finish merging multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/244154 [13:47:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: finish merging multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/244154 (owner: 10Filippo Giunchedi) [13:48:02] apergos: even worth on precise we have http://ppa.launchpad.net/saltstack/salt/ubuntu [13:48:18] in the lists? seriously? [13:48:32] well all the packages and dependencies are right in our wikimedia repo so enough of that [13:49:46] I dunno why beta has those when prod doesn't [13:49:57] all of that sounds like beta hacks some beta root did [13:50:07] we never had external PPAs or saltstack repos in prod [13:50:10] and we'd never do that [13:50:22] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [13:50:44] RECOVERY - puppet last run on db2063 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [13:50:53] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:51:23] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:54:15] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure: Clarify the salt version to use on beta cluster - https://phabricator.wikimedia.org/T114755#1708707 (10hashar) 5Open>3Resolved root@deployment-salt:~# salt '*' cmd.run 'grep -R salt /etc/apt/sources.list.d/' ``` lang=yaml deployment-cac... [13:58:23] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure: Clarify the salt version to use on beta cluster - https://phabricator.wikimedia.org/T114755#1708710 (10hashar) ``` # salt 'deployment-cache-*' pkg.version salt-common deployment-cache-mobile04.deployment-prep.eqiad.wmflabs: 2014.7.5+ds-1... [14:00:13] (03CR) 10Alexandros Kosiaris: [C: 031] Specify SSHD listen address for lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/243982 (owner: 10Rush) [14:03:19] s/SSHD/sshd/ [14:03:50] (03CR) 10Ottomata: [C: 031] "+1 in general, but I'd just call these what they are 'rsync' 'http' 'https', since the ferm rules themselves don't have anything to do wit" [puppet] - 10https://gerrit.wikimedia.org/r/244129 (owner: 10Muehlenhoff) [14:04:08] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1708733 (10chasemp) >>! In T113124#1707825, @Dzahn wrote: > @chasemp what was the potential problem that @yuvipanda mentioned. @aklapper do you have time now? how... [14:10:37] (03CR) 10Rush: [C: 04-1] "https://phabricator.wikimedia.org/T113124#1708733" [puppet] - 10https://gerrit.wikimedia.org/r/244116 (owner: 10Dzahn) [14:14:43] RECOVERY - Restbase root url on aqs1003 is OK: HTTP OK: HTTP/1.1 200 - 690 bytes in 0.012 second response time [14:15:24] RECOVERY - Restbase root url on aqs1002 is OK: HTTP OK: HTTP/1.1 200 - 690 bytes in 0.012 second response time [14:17:58] 6operations, 6Analytics-Kanban, 10netops, 5Patch-For-Review: Puppetize a server with a role that sets up Cassandra on Analytics machines [13 pts] {slug} - https://phabricator.wikimedia.org/T107056#1708785 (10mobrovac) [14:20:06] (03PS1) 10Filippo Giunchedi: cassandra: bring in instance variables from top scope [puppet] - 10https://gerrit.wikimedia.org/r/244155 [14:21:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: bring in instance variables from top scope [puppet] - 10https://gerrit.wikimedia.org/r/244155 (owner: 10Filippo Giunchedi) [14:22:04] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1708791 (10ssastry) >>! In T114558#1707933, @GWicke wrote: > Here is a break-down of parsoid requests failing with... [14:22:05] (03CR) 10PleaseStand: "> What regex engine is used?" [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [14:23:10] (03CR) 10Ottomata: [C: 032] hadoop/analytics: lint fixes - indentation [puppet/cdh] - 10https://gerrit.wikimedia.org/r/242031 (owner: 10Dzahn) [14:24:28] 6operations: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1708813 (10Dzahn) a:3Dzahn [14:24:43] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1708814 (10Dzahn) a:5yuvipanda>3Dzahn [14:26:19] 6operations: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1708831 (10JohnLewis) Needs per queue levels. The issue is all bounces emails and all digest emails go out at once which easily is 500+ emails at any one time. [14:27:50] (03PS1) 10Filippo Giunchedi: cassandra: reference per-instance tls material [puppet] - 10https://gerrit.wikimedia.org/r/244158 [14:27:52] (03PS10) 10coren: webservicemonitor: some improvements [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/239377 (https://phabricator.wikimedia.org/T109362) [14:28:13] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: reference per-instance tls material [puppet] - 10https://gerrit.wikimedia.org/r/244158 (owner: 10Filippo Giunchedi) [14:31:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [14:33:07] (03PS1) 10Mobrovac: RESTBase: Add the header_match security definition [puppet] - 10https://gerrit.wikimedia.org/r/244159 [14:36:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [14:37:22] (03PS2) 10Muehlenhoff: Enable ferm on analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/243150 [14:39:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] RESTBase: Add the header_match security definition [puppet] - 10https://gerrit.wikimedia.org/r/244159 (owner: 10Mobrovac) [14:39:42] (03PS3) 10Muehlenhoff: Enable ferm on analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/243150 [14:40:29] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/243150 (owner: 10Muehlenhoff) [14:40:53] 6operations, 6Phabricator, 7Database: phabricator dump script should use slave db, not master - https://phabricator.wikimedia.org/T112193#1708878 (10chasemp) 5Open>3Resolved a:3chasemp I did this :) [14:41:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [14:45:49] (03CR) 10Alex Monk: "I think you misunderstood. We should just be able to set wmgUseGlobalTitleBlacklist to false on labswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [14:49:21] (03PS1) 10Mobrovac: AQS: Load the pageviews module on start-up [puppet] - 10https://gerrit.wikimedia.org/r/244164 [14:50:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [14:51:15] (03PS1) 10Thiemo Mättig (WMDE): Add pageImagesPropertyIds configuration for Wikibase servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244165 (https://phabricator.wikimedia.org/T112865) [14:51:44] (03CR) 10Milimetric: [C: 031] "with apologies" [puppet] - 10https://gerrit.wikimedia.org/r/244164 (owner: 10Mobrovac) [14:54:45] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1708904 (10GWicke) Requests timing out at 8:37 would have been started for the first time at 8:33, considering RES... [14:55:00] (03CR) 10Ottomata: [C: 032] AQS: Load the pageviews module on start-up [puppet] - 10https://gerrit.wikimedia.org/r/244164 (owner: 10Mobrovac) [14:56:16] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1708907 (10ssastry) >>! In T114558#1708904, @GWicke wrote: > Requests timing out at 8:37 would have been started f... [14:59:48] !log AQS restarting restbase on aqs100x [14:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151007T1500). Please do the needful. [15:00:04] James_F Krenair RoanKattouw: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:12] * RoanKattouw waves [15:00:27] 6operations, 6Discovery, 10Maps, 10Traffic: maps: support wikivoyages in incubator - https://phabricator.wikimedia.org/T113122#1708926 (10Revi) [15:00:34] hey [15:00:57] Krenair: Do you want to do the honours? [15:01:20] sure [15:01:24] (03PS3) 10Alex Monk: VisualEditor: Switch to opt-out for English Wikipedia logged-in users only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242041 (https://phabricator.wikimedia.org/T112348) (owner: 10Jforrester) [15:01:32] (03CR) 10Alex Monk: [C: 032] VisualEditor: Switch to opt-out for English Wikipedia logged-in users only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242041 (https://phabricator.wikimedia.org/T112348) (owner: 10Jforrester) [15:01:38] (03Merged) 10jenkins-bot: VisualEditor: Switch to opt-out for English Wikipedia logged-in users only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242041 (https://phabricator.wikimedia.org/T112348) (owner: 10Jforrester) [15:01:50] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [15:02:03] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1708939 (10cscott) I got to #1714 of the #3274 requests between 8:20 and 8:40. Interpolating, that's probably 8:3... [15:02:20] (03PS1) 10Muehlenhoff: Don't open up the JMX port for debugging [puppet] - 10https://gerrit.wikimedia.org/r/244168 [15:02:38] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/242041/ (duration: 00m 18s) [15:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:13] !log krenair@tin Synchronized visualeditor-default.dblist: https://gerrit.wikimedia.org/r/#/c/242041/ (duration: 00m 17s) [15:03:14] (03CR) 10JanZerebecki: [C: 04-1] "Looks good. Needs to wait for I5fba0e4 to be merged, but it would be nice if this is deployed before that is deployed, so that can be test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244165 (https://phabricator.wikimedia.org/T112865) (owner: 10Thiemo Mättig (WMDE)) [15:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:43] Krenair: Everything LGTM. [15:05:29] I was going to say the same, but somehow the edit tab has disappeared for my user [15:05:52] For me too [15:06:10] No wait that's just on the main page [15:06:18] On a random page I do have two edit tabs [15:06:36] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1708953 (10cscott) Pasting here for the record: (11:18:26 AM) cscott-free: hm, graphite shows two huge OCG reques... [15:06:46] mw.user.options.get( 'visualeditor-enable' ) [15:06:46] 0 [15:07:02] according to the DB I have visualeditor-enable=1 [15:07:04] Krenair: What privs, and on what page? [15:07:18] https://en.wikipedia.org/wiki/User:Krenair/sandbox [15:07:27] Krenair: Reload? [15:07:41] User options are briefly cached, IIRC. [15:07:45] oh, wtf [15:07:46] okay [15:07:51] there we go [15:07:57] that's annoying [15:08:09] PROBLEM - Restbase endpoints health on restbase-test2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:08:19] Krenair: Yeah, welcome to MW. [15:08:29] PROBLEM - Restbase root url on restbase-test2001 is CRITICAL: Connection refused [15:09:29] (03PS1) 10Jcrespo: Pooling again db1051 with db1055's roles, to fix SPOF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244169 [15:10:10] (03PS1) 10KartikMistry: Added initial Debian package for apertium-es-it [debs/contenttranslation/apertium-es-it] - 10https://gerrit.wikimedia.org/r/244170 (https://phabricator.wikimedia.org/T111902) [15:10:15] (03PS2) 10Jcrespo: Pooling again db1051 with db1055's roles, to fix SPOF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244169 [15:10:51] (03PS1) 10Hashar: nodepool: monitor nodepoold is present [puppet] - 10https://gerrit.wikimedia.org/r/244171 (https://phabricator.wikimedia.org/T113806) [15:11:32] (03CR) 10Hashar: "Copied from the Zuul monitoring rule we have." [puppet] - 10https://gerrit.wikimedia.org/r/244171 (https://phabricator.wikimedia.org/T113806) (owner: 10Hashar) [15:11:57] (03CR) 10Jcrespo: [C: 032] Pooling again db1051 with db1055's roles, to fix SPOF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244169 (owner: 10Jcrespo) [15:13:27] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1051 after maintenance (duration: 00m 17s) [15:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:00] please report if you see any issue with recent changes, contributions or the watchlist, I've just added a new server for those services [15:16:04] on enwiki [15:16:59] (03CR) 10Gilles: [C: 031] Set page purge limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243363 (owner: 10Aaron Schulz) [15:18:16] …did someone break labs puppet earlier and then fix it? I’m wondering if these emails are due to real puppet breakage or an NFS outage or something [15:21:47] andrewbogott: No NFS outage that I can see, and I didn't break/unbreak puppet, but I'm guessing the latter. [15:23:01] (03PS1) 10Filippo Giunchedi: restbase: ensure service present [puppet] - 10https://gerrit.wikimedia.org/r/244174 (https://phabricator.wikimedia.org/T103134) [15:25:44] (03PS1) 10Dereckson: Document translation namespace best practices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244176 [15:25:46] (03PS1) 10Dereckson: Namespace configuration on bn.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244177 (https://phabricator.wikimedia.org/T114623) [15:29:04] 6operations, 6Labs, 10Labs-Infrastructure, 3labs-sprint-117: add logrotate for designate logs (holmium disk space) - https://phabricator.wikimedia.org/T114544#1709038 (10Andrew) 5Open>3Resolved Looks like it's working. [15:31:25] JohnFLewis: How can I set up the (recently closed) ee@ list to auto-reject incoming messages? Posts sent to it just end up in the moderation queue now [15:31:32] how could i detect from php which data center a request is running in (in mediawiki-config) ? [15:31:56] RoanKattouw: they do? they should automatically be discarded [15:32:06] (03CR) 10Ottomata: [C: 031] Don't open up the JMX port for debugging [puppet] - 10https://gerrit.wikimedia.org/r/244168 (owner: 10Muehlenhoff) [15:32:45] ebernhardson: RoanKattouw: is the swat window still open? [15:32:56] i've got a late entry that's breaking editing on mobile web for a sample of users... [15:32:58] https://gerrit.wikimedia.org/r/#/c/244178/ [15:33:10] jdlrobson: I have a late entry too that's breaking Echo :) [15:33:14] So I'll bring yours along for the ride [15:33:17] a sampled EventLogging schema is causing problems :/ [15:33:21] awesome thanks RoanKattouw [15:33:29] * RoanKattouw stares down Jenkins [15:35:54] (03PS1) 10Jforrester: Configure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244181 [15:38:27] (03PS2) 10Jforrester: Configure $wgRemoteUploadTarget [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244181 [15:38:35] (03CR) 10Andrew Bogott: [C: 032] nodepool: monitor nodepoold is present [puppet] - 10https://gerrit.wikimedia.org/r/244171 (https://phabricator.wikimedia.org/T113806) (owner: 10Hashar) [15:40:22] RoanKattouw: config says they're dropped yet mail is still coming in. heh. let me look into it :) [15:40:29] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [15:41:33] (03PS3) 10Jforrester: Configure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244181 [15:43:37] (03CR) 10GWicke: [C: 031] restbase: ensure service present [puppet] - 10https://gerrit.wikimedia.org/r/244174 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [15:44:03] 6operations, 6Release-Engineering-Team, 7Database: Recover missing values from user_properties tables - https://phabricator.wikimedia.org/T114899#1709107 (10jcrespo) 3NEW [15:44:04] (03PS4) 10Jforrester: Configure $wgRemoteUploadTarget [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244181 [15:44:15] !log catrope@tin Synchronized php-1.27.0-wmf.1/extensions/Echo: SWAT (duration: 00m 17s) [15:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:34] !log catrope@tin Synchronized php-1.27.0-wmf.2/extensions/Echo: SWAT (duration: 00m 18s) [15:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:52] !log catrope@tin Synchronized php-1.27.0-wmf.1/extensions/MobileFrontend: SWAT (duration: 00m 17s) [15:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:39] jdlrobson: Done ---^^ [15:45:54] RoanKattouw_away: confirmed! [15:45:55] yay! [15:45:58] it works again! [15:46:00] thank you :) [15:46:47] 6operations, 6Release-Engineering-Team, 7Database: Recover missing values from user_properties tables - https://phabricator.wikimedia.org/T114899#1709116 (10jcrespo) Due to the time that has passed since the values were lost, we are only going to reimport the missing values for now, to avoid annoying users a... [15:47:13] (03PS2) 10Muehlenhoff: Don't open up the JMX port for debugging [puppet] - 10https://gerrit.wikimedia.org/r/244168 [15:49:38] (03PS1) 10MarkTraceur: Add $wgForeignUploadTargets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244182 [15:52:30] Is SWAT over? Can I do something quick so I can demo something useful in my QR? :D [15:52:55] (03CR) 10Muehlenhoff: [C: 032 V: 032] Don't open up the JMX port for debugging [puppet] - 10https://gerrit.wikimedia.org/r/244168 (owner: 10Muehlenhoff) [15:53:18] The config won't even do anything in production yet, so merging it is enough, but don't want to leave an undeployed patch for y'all [15:54:11] 6operations, 6Release-Engineering-Team, 7Database: Recover missing values from user_properties tables - https://phabricator.wikimedia.org/T114899#1709137 (10jcrespo) [15:55:38] marktraceur: Go for it [15:55:43] RoanKattouw_away: Thanks. [15:55:53] (03PS2) 10Giuseppe Lavagetto: Convert logging from print to twisted.python.log [WiP] [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 [15:55:56] Just waiting to get a patch merged quick so I know which config patch to go with [15:56:44] (03PS1) 10KartikMistry: Added initial Debian package for apertium-es-ro [debs/contenttranslation/apertium-es-ro] - 10https://gerrit.wikimedia.org/r/244183 (https://phabricator.wikimedia.org/T111902) [15:57:05] OK, going [15:57:23] Oh, wait, need laptop with SSH keys. *sigh* [15:58:54] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1709164 (10mark) If we have hardware that is out of warranty and (therefore) won't be used for new production stuff, then it could be considered "free"... [15:58:55] (03CR) 10BryanDavis: [C: 031] "Retested with fresh cherry-pick in beta cluster. Data in ELK cluster looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [15:59:07] ottomata: ^ LGTM [16:00:21] (03CR) 10MarkTraceur: [C: 032] Add $wgForeignUploadTargets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244182 (owner: 10MarkTraceur) [16:00:27] (03Merged) 10jenkins-bot: Add $wgForeignUploadTargets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244182 (owner: 10MarkTraceur) [16:02:21] (03PS1) 10Filippo Giunchedi: cassandra: exclude instance name from seed list [puppet] - 10https://gerrit.wikimedia.org/r/244185 [16:02:25] Syncing [16:02:26] (03PS15) 10Ottomata: Consume EventLogging validation logs from Logstash [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [16:02:43] (03CR) 10Ottomata: [C: 032 V: 032] Consume EventLogging validation logs from Logstash [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [16:02:48] !log marktraceur@tin Synchronized wmf-config/: Adding new config variable for uploads to Commons (duration: 00m 17s) [16:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:03:40] Looks like nothing crashed, waiting to see outcome on beta. [16:03:48] I think pooling of db1051 worked, which is great news [16:03:53] (03PS2) 10Filippo Giunchedi: restbase: ensure service present [puppet] - 10https://gerrit.wikimedia.org/r/244174 (https://phabricator.wikimedia.org/T103134) [16:05:39] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: puppet fail [16:06:30] (03CR) 10Mobrovac: [C: 031] cassandra: exclude instance name from seed list [puppet] - 10https://gerrit.wikimedia.org/r/244185 (owner: 10Filippo Giunchedi) [16:08:30] (03PS1) 10Filippo Giunchedi: cassandra: add restbase-test2001-b [puppet] - 10https://gerrit.wikimedia.org/r/244186 [16:09:06] (03PS1) 10Muehlenhoff: Don't open up the Kafka JMX port for debugging [puppet] - 10https://gerrit.wikimedia.org/r/244187 [16:12:53] (03CR) 10Eevans: [C: 031] cassandra: exclude instance name from seed list [puppet] - 10https://gerrit.wikimedia.org/r/244185 (owner: 10Filippo Giunchedi) [16:14:26] (03CR) 10GWicke: [C: 031] "wheeee!" [puppet] - 10https://gerrit.wikimedia.org/r/244186 (owner: 10Filippo Giunchedi) [16:14:49] !log citoid deploying ec149fd5 [16:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:15:40] (03PS2) 10Filippo Giunchedi: cassandra: exclude instance name from seed list [puppet] - 10https://gerrit.wikimedia.org/r/244185 [16:15:46] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: exclude instance name from seed list [puppet] - 10https://gerrit.wikimedia.org/r/244185 (owner: 10Filippo Giunchedi) [16:19:55] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1709234 (10Andrew) [16:22:59] (03PS2) 10Filippo Giunchedi: cassandra: add restbase-test2001-b [puppet] - 10https://gerrit.wikimedia.org/r/244186 [16:23:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase-test2001-b [puppet] - 10https://gerrit.wikimedia.org/r/244186 (owner: 10Filippo Giunchedi) [16:24:43] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1709246 (10RobH) https://rt.wikimedia.org/Ticket/Display.html?id=9677 is the rt ticket to track the quoting of a 1u misc system for pricing consideratio... [16:27:06] (03PS3) 10Filippo Giunchedi: restbase: ensure service present [puppet] - 10https://gerrit.wikimedia.org/r/244174 (https://phabricator.wikimedia.org/T103134) [16:27:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: ensure service present [puppet] - 10https://gerrit.wikimedia.org/r/244174 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [16:28:40] PROBLEM - Cassandra database on restbase-test2001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [16:29:46] (03CR) 10BryanDavis: "Inline comment about generated config difference from beta cluster to prod" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [16:32:20] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: puppet fail [16:33:20] PROBLEM - puppet last run on restbase1002 is CRITICAL: CRITICAL: puppet fail [16:33:40] PROBLEM - Hadoop DataNode on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:33:50] PROBLEM - Check size of conntrack table on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:34:20] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: puppet fail [16:34:50] PROBLEM - DPKG on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:09] PROBLEM - puppet last run on restbase2003 is CRITICAL: CRITICAL: puppet fail [16:35:10] PROBLEM - Disk space on Hadoop worker on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:20] PROBLEM - salt-minion processes on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:20] PROBLEM - puppet last run on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:41] PROBLEM - Disk space on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:41] PROBLEM - Hadoop JournalNode on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:49] PROBLEM - SSH on analytics1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:59] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: puppet fail [16:36:00] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: puppet fail [16:36:00] PROBLEM - puppet last run on restbase2004 is CRITICAL: CRITICAL: puppet fail [16:36:01] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:36:01] PROBLEM - Hadoop NodeManager on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:36:20] PROBLEM - RAID on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:36:20] PROBLEM - configured eth on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:36:21] (03PS1) 10Mforns: Stringify tags passed in Logstash's kafka.erb [puppet] - 10https://gerrit.wikimedia.org/r/244191 (https://phabricator.wikimedia.org/T113627) [16:36:39] PROBLEM - dhclient process on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:36:51] <_joe_> ottomata: is that you? ^^ [16:37:40] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: puppet fail [16:38:14] analytics1035 paged and looks broken..on mgmt [16:38:28] i'm going to powercycle it [16:38:32] no output [16:38:48] not sure if the other "xenon" etc have to do with it [16:38:55] <_joe_> mutante: nothing [16:39:11] <_joe_> mutante: if you're powercycling it, I'll keep away :) [16:39:18] (03CR) 10Ottomata: [C: 032] Stringify tags passed in Logstash's kafka.erb [puppet] - 10https://gerrit.wikimedia.org/r/244191 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [16:39:19] PROBLEM - puppet last run on restbase-test2003 is CRITICAL: CRITICAL: puppet fail [16:39:29] eh, and ottomata, you are here though? [16:39:36] mutante: no not related, I'm fixing those [16:39:42] !log powercycling analytics1035 [16:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:40:15] (03PS1) 10Filippo Giunchedi: Revert "restbase: ensure service present" [puppet] - 10https://gerrit.wikimedia.org/r/244193 [16:40:29] and i see it booting ... [16:40:30] (03PS2) 10Filippo Giunchedi: Revert "restbase: ensure service present" [puppet] - 10https://gerrit.wikimedia.org/r/244193 [16:40:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "restbase: ensure service present" [puppet] - 10https://gerrit.wikimedia.org/r/244193 (owner: 10Filippo Giunchedi) [16:41:00] PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: puppet fail [16:41:20] PROBLEM - puppet last run on restbase1003 is CRITICAL: CRITICAL: puppet fail [16:42:00] PROBLEM - puppet last run on restbase1001 is CRITICAL: CRITICAL: puppet fail [16:42:30] RECOVERY - Check size of conntrack table on analytics1035 is OK: OK: nf_conntrack is 0 % full [16:42:30] PROBLEM - puppet last run on restbase1004 is CRITICAL: CRITICAL: puppet fail [16:42:31] RECOVERY - Disk space on analytics1035 is OK: DISK OK [16:42:39] RECOVERY - Hadoop JournalNode on analytics1035 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode [16:42:47] RECOVERY - SSH on analytics1035 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:43:00] RECOVERY - Hadoop NodeManager on analytics1035 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:43:00] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [16:43:11] are you sure this box crashed? [16:43:19] RECOVERY - RAID on analytics1035 is OK: OK: optimal, 13 logical, 14 physical [16:43:20] RECOVERY - configured eth on analytics1035 is OK: OK - interfaces up [16:43:30] RECOVERY - dhclient process on analytics1035 is OK: PROCS OK: 0 processes with command name dhclient [16:43:35] ottomata: there was no output at all on mgmt console, and right after powercycling there was [16:43:39] hmmm, ok. [16:43:40] RECOVERY - DPKG on analytics1035 is OK: All packages OK [16:43:49] RECOVERY - puppet last run on restbase1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:43:51] RECOVERY - Disk space on Hadoop worker on analytics1035 is OK: DISK OK [16:44:00] PROBLEM - puppet last run on restbase2005 is CRITICAL: CRITICAL: puppet fail [16:44:01] RECOVERY - salt-minion processes on analytics1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:44:01] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures [16:44:04] ok thanks, shoudl be fine. [16:44:10] RECOVERY - Hadoop DataNode on analytics1035 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [16:44:18] ok [16:49:00] RECOVERY - puppet last run on restbase1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:51:40] RECOVERY - puppet last run on restbase1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:54:50] RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:56:14] (03PS1) 10BryanDavis: logstash: Explicitly stringify array in kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/244198 [16:56:39] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:56:49] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:57:39] RECOVERY - puppet last run on restbase2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:57:40] RECOVERY - puppet last run on restbase2005 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:58:01] RECOVERY - puppet last run on restbase1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:58:11] RECOVERY - puppet last run on restbase-test2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:58:20] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [16:58:30] RECOVERY - puppet last run on restbase2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:00:04] moritzm: Dear anthropoid, the time has come. Please deploy Operations (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151007T1700). [17:01:14] interesting name [17:02:56] (03PS2) 10Muehlenhoff: archiva: move ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/244129 [17:02:57] we'll get deployed to PR next weekend, in some sense [17:03:40] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:03:41] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:04:25] godog: that has to one the deployment calendar :P "Please deploy Operations to Puerto Rico" [17:04:44] (03CR) 10Muehlenhoff: [C: 032 V: 032] archiva: move ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/244129 (owner: 10Muehlenhoff) [17:05:12] JohnFLewis: hahah we should do that [17:05:34] (03CR) 10Mforns: [C: 031] logstash: Explicitly stringify array in kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/244198 (owner: 10BryanDavis) [17:12:41] PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [17:14:35] (03PS2) 10Andrew Bogott: swift: fix some alignement in rewrite.py [puppet] - 10https://gerrit.wikimedia.org/r/244147 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [17:16:47] (03CR) 10Andrew Bogott: [C: 032] swift: fix some alignement in rewrite.py [puppet] - 10https://gerrit.wikimedia.org/r/244147 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [17:21:37] 6operations, 6Release-Engineering-Team, 7Database, 5Patch-For-Review: Recover missing values from user_properties tables - https://phabricator.wikimedia.org/T114899#1709471 (10jcrespo) After inspecting the current state of the table, we believe that insert will not have any noticible effect, due to changes... [17:28:41] is there any way yet for php code (operations/mediawiki-config specifically) to know which data center it's running in? [17:28:41] yes [17:28:41] * ebernhardson rephrases [17:28:42] how? [17:28:42] :P [17:28:42] $wmfDatacenter [17:28:42] see the README file [17:28:44] /ignore Krenair!* [17:28:44] :P [17:28:44] It mentions stuff about MWRealm which contains the setting for wmfDatacenter [17:28:44] :) [17:28:44] thanks [17:31:39] (03PS1) 10BBlack: X-Client-IP 1/12 - just move netmapper import + init [puppet] - 10https://gerrit.wikimedia.org/r/244201 (https://phabricator.wikimedia.org/T89177) [17:31:41] (03PS1) 10BBlack: X-Client-IP 2/12 - rename ip_proc sub, move req.restarts guard [puppet] - 10https://gerrit.wikimedia.org/r/244202 (https://phabricator.wikimedia.org/T89177) [17:31:43] (03PS1) 10BBlack: X-Client-IP 3/12 - remove fe default on be guard [puppet] - 10https://gerrit.wikimedia.org/r/244203 (https://phabricator.wikimedia.org/T89177) [17:31:45] (03PS1) 10BBlack: X-Client-IP 4/12 - move XFF-setter out of recv_fe_ip_processing [puppet] - 10https://gerrit.wikimedia.org/r/244204 (https://phabricator.wikimedia.org/T89177) [17:31:47] (03PS1) 10BBlack: X-Client-IP 5/12 - recv_fe_ip_proc frontend-only [puppet] - 10https://gerrit.wikimedia.org/r/244205 (https://phabricator.wikimedia.org/T89177) [17:31:49] (03PS1) 10BBlack: X-Client-IP 6/12 - unset the 4x new headers [puppet] - 10https://gerrit.wikimedia.org/r/244206 (https://phabricator.wikimedia.org/T89177) [17:31:51] (03PS1) 10BBlack: X-Client-IP 7/12 - Set X-T-P [puppet] - 10https://gerrit.wikimedia.org/r/244207 (https://phabricator.wikimedia.org/T89177) [17:31:53] (03PS1) 10BBlack: X-Client-IP 8/12 - Set X-CIP [puppet] - 10https://gerrit.wikimedia.org/r/244208 (https://phabricator.wikimedia.org/T89177) [17:31:55] (03PS1) 10BBlack: X-Client-IP 9/12 - Set X-C + X-C-M [puppet] - 10https://gerrit.wikimedia.org/r/244209 (https://phabricator.wikimedia.org/T89177) [17:31:57] (03PS1) 10BBlack: X-Client-IP 10/12 - switch zero.inc to using XC + XCM [puppet] - 10https://gerrit.wikimedia.org/r/244210 (https://phabricator.wikimedia.org/T89177) [17:31:59] (03PS1) 10BBlack: X-Client-IP 11/12 - remove outdated 404-01b zero case [puppet] - 10https://gerrit.wikimedia.org/r/244211 (https://phabricator.wikimedia.org/T89177) [17:32:01] (03PS1) 10BBlack: X-Client-IP 12/12 - switch zero analytics to use XC/XCM [puppet] - 10https://gerrit.wikimedia.org/r/244212 (https://phabricator.wikimedia.org/T89177) [17:32:50] (03CR) 10Daniel Kinzler: "@PleaseStand Thanks for the information!" [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [17:36:31] (03PS2) 10Daniel Kinzler: Avoid breaking full phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/242237 [17:36:31] 6operations, 6Release-Engineering-Team, 7Database, 5Patch-For-Review: Recover missing values from user_properties tables - https://phabricator.wikimedia.org/T114899#1709553 (10jcrespo) https://gerrit.wikimedia.org/r/#/c/244190/ was created for this task, and abandoned. Having a timestamp on user preferenc... [17:40:13] (03PS2) 10Dzahn: Revert "Revert "admin: Allow aklapper to reset user auths and delete accounts in Phab"" [puppet] - 10https://gerrit.wikimedia.org/r/244116 [17:40:39] (03CR) 10Jdlrobson: [C: 031] Enable banners on all namespaces on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243837 (https://phabricator.wikimedia.org/T114566) (owner: 10Jdlrobson) [17:41:12] (03CR) 10Dzahn: [C: 032] "reverting the change that says " Re-revert any time!". this is partly right and needed. then a fix on top of that will follow to fix up th" [puppet] - 10https://gerrit.wikimedia.org/r/244116 (owner: 10Dzahn) [17:43:59] (03PS2) 10Ottomata: logstash: Explicitly stringify array in kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/244198 (owner: 10BryanDavis) [17:44:03] (03CR) 10Ottomata: [C: 032] logstash: Explicitly stringify array in kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/244198 (owner: 10BryanDavis) [17:44:10] (03CR) 10Ottomata: [V: 032] logstash: Explicitly stringify array in kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/244198 (owner: 10BryanDavis) [17:45:20] RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0] [17:46:53] (03Abandoned) 10Alex Monk: Unbreak non-infile mode [debs/ircecho] - 10https://gerrit.wikimedia.org/r/236379 (owner: 10Alex Monk) [17:47:32] (03PS1) 10Dzahn: admin: fix sudo rules for phab admin, auth strip [puppet] - 10https://gerrit.wikimedia.org/r/244214 (https://phabricator.wikimedia.org/T113124) [17:47:59] (03PS3) 10Yuvipanda: labstore: Add a delete-dbuser script [puppet] - 10https://gerrit.wikimedia.org/r/244120 [17:48:46] (03PS2) 10Dzahn: admin: fix sudo rules for phab admin, auth strip [puppet] - 10https://gerrit.wikimedia.org/r/244214 (https://phabricator.wikimedia.org/T113124) [17:48:48] (03PS1) 10Andrew Bogott: Openstack: Added a custom keystone/policy.json [puppet] - 10https://gerrit.wikimedia.org/r/244215 (https://phabricator.wikimedia.org/T104588) [17:49:45] (03CR) 10Dzahn: [C: 032] "this is just about fixing the intended access that has been acked but didn't work like that" [puppet] - 10https://gerrit.wikimedia.org/r/244214 (https://phabricator.wikimedia.org/T113124) (owner: 10Dzahn) [17:50:22] (03PS2) 10Andrew Bogott: Openstack: Added a custom keystone/policy.json [puppet] - 10https://gerrit.wikimedia.org/r/244215 (https://phabricator.wikimedia.org/T104588) [17:51:43] (03CR) 10Andrew Bogott: [C: 032] Openstack: Added a custom keystone/policy.json [puppet] - 10https://gerrit.wikimedia.org/r/244215 (https://phabricator.wikimedia.org/T104588) (owner: 10Andrew Bogott) [17:52:37] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1709623 (10Dzahn) @chasemp thanks for the detailed explanation! that helped. I talked with @aklapper and went with this option: ``` 'ALL = NOPASSWD: /srv/phab/... [17:55:35] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1709630 (10Dzahn) @aklapper applied on iridium. is there a test user you could confirm this with and strip auth (then re-add it ?) [18:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151007T1800). [18:01:55] Looks like I need to take another poke at open mediawiki-config changes during the next swat [18:02:17] (03PS4) 10Yuvipanda: labstore: Add a delete-dbuser script [puppet] - 10https://gerrit.wikimedia.org/r/244120 [18:02:31] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Add a delete-dbuser script [puppet] - 10https://gerrit.wikimedia.org/r/244120 (owner: 10Yuvipanda) [18:03:00] (03PS1) 10EBernhardson: Define cirrussearch shards/replicas per datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244221 [18:04:48] 6operations: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1709675 (10VBaranetsky) 3NEW [18:05:03] (03PS2) 10Alex Monk: Document translation namespace best practices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244176 (owner: 10Dereckson) [18:05:38] (03PS2) 10Alex Monk: Namespace configuration on bn.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244177 (https://phabricator.wikimedia.org/T114623) (owner: 10Dereckson) [18:05:46] (03CR) 10BBlack: [C: 031] Specify SSHD listen address for lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/243982 (owner: 10Rush) [18:06:34] 6operations, 10Wikimedia-DNS: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1709683 (10Krenair) [18:08:01] RECOVERY - Restbase root url on restbase-test2001 is OK: HTTP OK: HTTP/1.1 200 - 15118 bytes in 0.122 second response time [18:08:01] 6operations, 10Wikimedia-DNS: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1709691 (10BBlack) p:5Triage>3Normal a:3Dzahn [18:08:20] RECOVERY - Restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [18:08:22] ehm, when i type /win 100 it might be time to close some old windows [18:08:43] 6operations, 10Wikimedia-DNS: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1709675 (10BBlack) See also: T101048 - we'll need to remember these new ones as they're coming to us in the middle of that cleanup process... [18:09:38] 6operations, 10Wikimedia-DNS, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1709712 (10Dzahn) [18:13:00] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [18:13:37] (03PS1) 10Andrew Bogott: Openstack: Removed puppet header from policy.json [puppet] - 10https://gerrit.wikimedia.org/r/244225 [18:13:42] (03CR) 10jenkins-bot: [V: 04-1] Openstack: Removed puppet header from policy.json [puppet] - 10https://gerrit.wikimedia.org/r/244225 (owner: 10Andrew Bogott) [18:13:52] (03PS2) 10Andrew Bogott: Openstack: Removed puppet header from policy.json [puppet] - 10https://gerrit.wikimedia.org/r/244225 [18:15:19] (03CR) 10Andrew Bogott: [C: 032] Openstack: Removed puppet header from policy.json [puppet] - 10https://gerrit.wikimedia.org/r/244225 (owner: 10Andrew Bogott) [18:19:13] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for VBaranetsky - https://phabricator.wikimedia.org/T114308#1709747 (10JohnLewis) vbaranetsky is associated with the @wikimedia.org address. Seems connected to me. [18:19:18] (03PS1) 10John F. Lewis: admin: create account for vbaranetsky + add groups [puppet] - 10https://gerrit.wikimedia.org/r/244227 (https://phabricator.wikimedia.org/T114308) [18:19:34] (03PS2) 10John F. Lewis: admin: create account for vbaranetsky + add groups [puppet] - 10https://gerrit.wikimedia.org/r/244227 (https://phabricator.wikimedia.org/T114308) [18:24:44] (03Abandoned) 10Jforrester: Configure $wgRemoteUploadTarget [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244181 (owner: 10Jforrester) [18:24:58] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for VBaranetsky - https://phabricator.wikimedia.org/T114308#1709772 (10JohnLewis) a:5VBaranetsky>3None [18:25:11] uhm, fatalmonitor isn't working? [18:25:49] (03CR) 10Hashar: "That fails puppet on labnodepool1001 :-( I ran it through the puppet compiler https://puppet-compiler.wmflabs.org/960/labnodepool1001.eqia" [puppet] - 10https://gerrit.wikimedia.org/r/244171 (https://phabricator.wikimedia.org/T113806) (owner: 10Hashar) [18:27:46] Undefined variable: wmgForeignUploadTargets in /srv/mediawiki/wmf-config/CommonSettings.php on line 2057 [18:29:57] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/244182/ [18:30:09] exists in labs only [18:30:18] (well, variable use) [18:30:39] (03PS1) 10Alex Monk: Rename mediawiki::web::sites to mediawiki::web::prod_sites to make room for a new generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/244228 [18:31:28] (03CR) 10Rush: "a note: We had a chat about having multiple cluster queues in the backlog and that this will reduce the time we can sustain backpressure d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [18:31:57] (03PS1) 10Hashar: nodepool: use nrpe:: class for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/244229 (https://phabricator.wikimedia.org/T113806) [18:32:07] (03CR) 10Hashar: "Follow up: https://gerrit.wikimedia.org/r/244229" [puppet] - 10https://gerrit.wikimedia.org/r/244171 (https://phabricator.wikimedia.org/T113806) (owner: 10Hashar) [18:32:09] (03PS1) 10Andrew Bogott: Nodepool: Changed monitor_service to nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/244230 [18:32:17] hashar: I suppose you just did the same thing as ^ [18:32:51] (03CR) 10Andrew Bogott: [C: 032] nodepool: use nrpe:: class for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/244229 (https://phabricator.wikimedia.org/T113806) (owner: 10Hashar) [18:33:15] (03PS2) 10Alex Monk: Rename mediawiki::web::sites to mediawiki::web::prod_sites to make room for a new generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/244228 [18:33:38] (03Abandoned) 10Andrew Bogott: Nodepool: Changed monitor_service to nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/244230 (owner: 10Andrew Bogott) [18:34:25] andrewbogott: ah sorry [18:34:36] I merged yours and dropped mine — seems happy now [18:34:37] both classes confused me [18:34:45] I changed my mind in between and end up with a mixed change [18:35:20] RECOVERY - puppet last run on labnodepool1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:35:33] \O/ [18:36:20] greg-g: any objections against us doing a RB deploy between 12 & 1pm, rather than after 1pm? [18:37:46] JohnFLewis: that needs to use isset to avoid spamming logs [18:39:03] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1709819 (10Andrew) [18:39:14] twentyafterfour: could i ask you test a command on iridium [18:39:24] mutante: ok [18:39:32] what command? [18:39:35] gwicke: should be fine [18:39:42] (03PS1) 10John F. Lewis: mw_rc_irc: rename to standard naming scheme (underscores) [puppet] - 10https://gerrit.wikimedia.org/r/244236 [18:39:51] https://phabricator.wikimedia.org/rOMWCb15a5cba3d1a1a570dbc7ba8bb14578cda3f636f is blocking train deployment [18:39:52] (03PS1) 10Alex Monk: Begin to merge production and beta apache config, starting with nonexistent.conf [puppet] - 10https://gerrit.wikimedia.org/r/244237 [18:39:56] (03PS2) 10John F. Lewis: mw_rc_irc: rename to standard naming scheme (underscores) [puppet] - 10https://gerrit.wikimedia.org/r/244236 [18:40:08] greg-g: fyi ^^^ [18:40:12] twentyafterfour: sudo /srv/phab/phabricator/auth strip --all-types --user Malyacko [18:40:23] marktraceur: https://phabricator.wikimedia.org/rOMWCb15a5cba3d1a1a570dbc7ba8bb14578cda3f636f [18:40:29] greg-g: okay [18:40:34] twentyafterfour: it's andre's second user, he's ok with it, we want to confirm that the sudo rule works [18:41:17] twentyafterfour: he gets prompted for password but .. i dont really see why and you are in the same group [18:41:20] mutante: ok but I already have sudo access to all commands [18:41:35] twentyafterfour: oooh, you are phab-root vs. phab-admin [18:41:42] ehm.. yea [18:41:55] then i think thank you and never mind [18:42:55] ah [18:43:46] marktraceur: are you around? [18:46:01] (03CR) 10EBernhardson: [C: 04-2] "this is incorrect, we need to vary these in CirrusSearch and not in mediawiki-config, because one datacenter can talk to the other datacen" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244221 (owner: 10EBernhardson) [18:47:23] (03CR) 10John F. Lewis: "http://puppet-compiler.wmflabs.org/962/ shows a change but is the MOTD, which is expected." [puppet] - 10https://gerrit.wikimedia.org/r/244236 (owner: 10John F. Lewis) [18:49:06] twentyafterfour: [18:49:06] 8:48 < greg-g> heyo, there's this commit that is spamming production and holding up the train: https://phabricator.wikimedia.org/rOMWCb15a5cba3d1a1a570dbc7ba8bb14578cda3f636f [18:49:10] 18:48 < greg-g> can someone confirm if we can simply revert it? marktraceur isn't at the keyboard at the moment [18:49:14] 18:48 <+ MatmaRex> greg-g: yes, you can revert it [18:49:24] ebernhardson: https://reviews.facebook.net/D44973#820155 fyi [18:49:25] (from -multimedia, because, we love IRC channels) [18:49:40] (i guess it should suffice to change `if ( $wmgForeignUploadTargets )` to `if ( isset( $wmgForeignUploadTargets ) )` ?) [18:50:08] I prefer reverts not fixing forward, in these situations, where the commiter is not around [18:50:09] twentyafterfour: one more question. if you see this list of commands on the right https://gerrit.wikimedia.org/r/#/c/244214/2/modules/admin/data/data.yaml what's a good test to run that is harmless and doesnt influence things [18:50:18] but whatever, we'll deploy an unfucked version later [18:50:23] :) [18:51:03] (03PS1) 10: Revert "Add $wgForeignUploadTargets" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244239 [18:51:36] twentyafterfour: ^ [18:53:41] mutante: /srv/phab/phabricator/bin/repository should be harmless [18:53:54] without any args it is anyway [18:54:14] twentyafterfour: thanks [18:54:23] greg-g: thanks [18:54:37] (03PS1) 10MarkTraceur: Use isset instead of just plain existence check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244241 [18:54:37] also, thanks MatmaRex [18:54:38] greg-g: ^^ [18:55:02] marktraceur: awesome +2 [18:55:06] (03CR) 1020after4: [C: 032] Use isset instead of just plain existence check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244241 (owner: 10MarkTraceur) [18:55:09] Thanks guys! [18:55:12] (03Merged) 10jenkins-bot: Use isset instead of just plain existence check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244241 (owner: 10MarkTraceur) [18:55:46] (03Abandoned) 10: Revert "Add $wgForeignUploadTargets" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244239 [18:56:14] (03PS1) 10John F. Lewis: wdq_mm: rename module to standard naming (underscores) [puppet] - 10https://gerrit.wikimedia.org/r/244242 [18:56:28] !log twentyafterfour@tin Synchronized wmf-config/CommonSettings.php: fix undefined variable warning that has been spamming logs (duration: 00m 17s) [18:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:58:33] twentyafterfour: JohnFLewis: duh.. all commands are in a ./bin/ except one.. that's the issue :p [18:58:57] doh [18:58:58] really? haha [19:00:37] (03CR) 10John F. Lewis: "This is used in labs only. Per that, adding Yuvi as he is a project admin for the project. (03PS1) 10Dzahn: admin: fix phab admin sudo rules pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/244243 (https://phabricator.wikimedia.org/T113124) [19:01:04] (03CR) 10jenkins-bot: [V: 04-1] admin: fix phab admin sudo rules pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/244243 (https://phabricator.wikimedia.org/T113124) (owner: 10Dzahn) [19:01:11] (03PS2) 10Dzahn: admin: fix phab admin sudo rules pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/244243 (https://phabricator.wikimedia.org/T113124) [19:01:29] yuvipanda: can you add https://gerrit.wikimedia.org/r/#/c/244242/ onto your review list/workflow when you can? thanks! :) [19:02:02] (03CR) 10John F. Lewis: [C: 031] admin: fix phab admin sudo rules pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/244243 (https://phabricator.wikimedia.org/T113124) (owner: 10Dzahn) [19:02:25] (03CR) 10Dzahn: [C: 032] admin: fix phab admin sudo rules pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/244243 (https://phabricator.wikimedia.org/T113124) (owner: 10Dzahn) [19:03:27] andre__: ^ try again now [19:04:41] doin' [19:04:56] twentyafterfour: Usage Exception but it works nevertheless, heh [19:07:33] (03PS1) 1020after4: group1 wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244245 [19:08:25] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1709873 (10Dzahn) confirmed it works now for andre after the fix above [19:08:31] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1709874 (10Dzahn) 5Open>3Resolved [19:09:06] (03CR) 1020after4: [C: 032] group1 wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244245 (owner: 1020after4) [19:09:13] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244245 (owner: 1020after4) [19:09:37] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.2 [19:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:48] JohnFLewis: thanks :) I'll merge it now [19:11:41] (03PS2) 10Yuvipanda: wdq_mm: rename module to standard naming (underscores) [puppet] - 10https://gerrit.wikimedia.org/r/244242 (owner: 10John F. Lewis) [19:12:40] (03CR) 10Yuvipanda: [C: 032] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/244242 (owner: 10John F. Lewis) [19:14:08] !log deploying c20e6336 to canary node restbase1001.eqiad [19:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:17:45] (03PS5) 10Rush: Specify SSHD listen address for lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/243982 [19:17:47] (03PS1) 10Ottomata: Add statsd param to hadoop jmxtrans classes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/244248 (https://phabricator.wikimedia.org/T90642) [19:17:56] (03CR) 10Rush: [C: 032 V: 032] Specify SSHD listen address for lvs hosts [puppet] - 10https://gerrit.wikimedia.org/r/243982 (owner: 10Rush) [19:18:42] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, and 2 others: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1709909 (10chasemp) [19:19:11] (03CR) 10Dzahn: [C: 031] "confirmed UID. cn: Vbaranetsky. mail: vbaranetsky@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/244227 (https://phabricator.wikimedia.org/T114308) (owner: 10John F. Lewis) [19:19:15] heh topic is wrong [19:19:33] robh: ^ [19:19:41] (03PS2) 10Ottomata: Add statsd param to hadoop jmxtrans classes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/244248 (https://phabricator.wikimedia.org/T90642) [19:19:43] ? [19:19:49] it looks as if you are on duty [19:19:57] well, whoever replaces me is supposed to change that [19:19:58] * JohnFLewis thought Chris was [19:20:02] cmjohnson1: ^ [19:20:08] you forgot to update your name into the topic [19:20:22] (03CR) 10Ottomata: [C: 032] Add statsd param to hadoop jmxtrans classes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/244248 (https://phabricator.wikimedia.org/T90642) (owner: 10Ottomata) [19:20:35] no biggie =] [19:21:37] puppet run warnings on lvs* hosts atm would be me, I'm slowly rolling out a chagne [19:21:43] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [19:21:45] (03PS1) 10Ottomata: Configure hadoop master and worker to send jmx stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/244249 (https://phabricator.wikimedia.org/T90642) [19:21:50] (03CR) 10jenkins-bot: [V: 04-1] Configure hadoop master and worker to send jmx stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/244249 (https://phabricator.wikimedia.org/T90642) (owner: 10Ottomata) [19:21:54] PROBLEM - Restbase root url on restbase1001 is CRITICAL: Connection refused [19:23:00] godog: ^? [19:23:09] chasemp: we are deploying [19:23:27] and this deploy involves the creation of new tables [19:23:32] 10Ops-Access-Requests, 6operations: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1709925 (10Dzahn) [19:23:37] so startup is slower than usual [19:23:52] (03PS2) 10Ottomata: Configure hadoop master and worker to send jmx stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/244249 (https://phabricator.wikimedia.org/T90642) [19:24:54] (03CR) 10Ottomata: [C: 032] Configure hadoop master and worker to send jmx stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/244249 (https://phabricator.wikimedia.org/T90642) (owner: 10Ottomata) [19:26:01] k [19:26:53] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [19:27:03] RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15118 bytes in 0.008 second response time [19:29:19] (03CR) 10Dzahn: "nevermind my comment about the LDAP group. it's irrelevant here and not requested. the rest should be ok, but double check groups and key" [puppet] - 10https://gerrit.wikimedia.org/r/244227 (https://phabricator.wikimedia.org/T114308) (owner: 10John F. Lewis) [19:29:39] !log canary deploy to restbase1001.eqiad complete [19:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:49] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for VBaranetsky - https://phabricator.wikimedia.org/T114308#1709939 (10Dzahn) a:3Cmjohnson @Cmjohnson can you follow-up here as part of the on-duty week? i confirmed the UID matches the labs user and it has a @wikimedia.or... [19:37:48] (03PS2) 10Dzahn: logstash: access to port 9200 for krypton [puppet] - 10https://gerrit.wikimedia.org/r/244095 (https://phabricator.wikimedia.org/T114836) [19:41:28] !log doing full deploy of c20e6336 to RESTBase [19:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:43:25] (03PS2) 10BBlack: X-Client-IP 1/12 - just move netmapper import + init [puppet] - 10https://gerrit.wikimedia.org/r/244201 (https://phabricator.wikimedia.org/T89177) [19:43:27] (03PS3) 10Cmjohnson: admin: create account for vbaranetsky + add groups [puppet] - 10https://gerrit.wikimedia.org/r/244227 (https://phabricator.wikimedia.org/T114308) (owner: 10John F. Lewis) [19:43:33] (03CR) 10BBlack: [C: 032 V: 032] X-Client-IP 1/12 - just move netmapper import + init [puppet] - 10https://gerrit.wikimedia.org/r/244201 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [19:45:05] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for VBaranetsky - https://phabricator.wikimedia.org/T114308#1709990 (10Cmjohnson) I also confirmed the UID matches, the user now has a wikitech account as per Rob's suggestion. It has been more than 3 days and merging the pa... [19:45:52] (03PS4) 10Cmjohnson: admin: create account for vbaranetsky + add groups [puppet] - 10https://gerrit.wikimedia.org/r/244227 (https://phabricator.wikimedia.org/T114308) (owner: 10John F. Lewis) [19:48:30] (03CR) 10Cmjohnson: [C: 032] admin: create account for vbaranetsky + add groups [puppet] - 10https://gerrit.wikimedia.org/r/244227 (https://phabricator.wikimedia.org/T114308) (owner: 10John F. Lewis) [19:49:45] (03CR) 10Dzahn: "i ran "@logstash1001:~# tcpdump dst host logstash1001.eqiad.wmnet and dst port 9200" and confirmed traffic from krypton and neon, which is" [puppet] - 10https://gerrit.wikimedia.org/r/244095 (https://phabricator.wikimedia.org/T114836) (owner: 10Dzahn) [19:50:10] (03PS3) 10Dzahn: logstash: access to port 9200 for krypton [puppet] - 10https://gerrit.wikimedia.org/r/244095 (https://phabricator.wikimedia.org/T114836) [19:50:29] !log RESTBase deploy complete [19:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:51:19] (03CR) 10Dzahn: [C: 032] logstash: access to port 9200 for krypton [puppet] - 10https://gerrit.wikimedia.org/r/244095 (https://phabricator.wikimedia.org/T114836) (owner: 10Dzahn) [19:51:41] 6operations, 10Salt: salt still has issues with grain selection? - https://phabricator.wikimedia.org/T114937#1710015 (10BBlack) 3NEW [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151007T2000). [20:00:14] no deploy today. [20:04:16] (03PS1) 10Ori.livneh: Make the 'sessions' redis cache multi-DC ready [puppet] - 10https://gerrit.wikimedia.org/r/244325 (https://phabricator.wikimedia.org/T111575) [20:07:07] (03PS2) 10Ori.livneh: Make the 'sessions' redis cache multi-DC ready [puppet] - 10https://gerrit.wikimedia.org/r/244325 (https://phabricator.wikimedia.org/T111575) [20:07:18] (03CR) 10Ori.livneh: [C: 032 V: 032] Make the 'sessions' redis cache multi-DC ready [puppet] - 10https://gerrit.wikimedia.org/r/244325 (https://phabricator.wikimedia.org/T111575) (owner: 10Ori.livneh) [20:12:46] (03PS1) 10Ori.livneh: Switch mw1017 to use DC-specific redis cluster names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244330 [20:13:38] (03CR) 10Ori.livneh: [C: 032] Switch mw1017 to use DC-specific redis cluster names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244330 (owner: 10Ori.livneh) [20:13:43] (03Merged) 10jenkins-bot: Switch mw1017 to use DC-specific redis cluster names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244330 (owner: 10Ori.livneh) [20:14:34] !log ori@tin Synchronized wmf-config/session.php: Ie25c368a: Switch mw1017 to use DC-specific redis cluster names (duration: 00m 17s) [20:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:29] (03PS4) 10Dzahn: mw-rc-irc: firewall hole for RC IRC bot [puppet] - 10https://gerrit.wikimedia.org/r/244068 (https://phabricator.wikimedia.org/T104943) [20:17:22] (03CR) 10Dzahn: [C: 032] mw-rc-irc: firewall hole for RC IRC bot [puppet] - 10https://gerrit.wikimedia.org/r/244068 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [20:18:09] (03PS1) 10MarkTraceur: Fix labs settings for foreign uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244332 [20:18:29] MatmaRex: ^ let's make sure I did it right this time [20:19:03] (03CR) 10Bartosz Dziewoński: [C: 031] Fix labs settings for foreign uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244332 (owner: 10MarkTraceur) [20:20:00] Cool beans. [20:20:19] Is it a bad time to merge a labs-only config patch? Anyone mind if I sync it? [20:20:58] ostriches: any chance you remember how to trigger a reindex in elasticsearch? me and stas are puzzling over it [20:20:59] (03PS1) 10Andrew Bogott: Openstack: set OS_IDENTITY_API_VERSION=3 [puppet] - 10https://gerrit.wikimedia.org/r/244333 [20:21:17] s/elasticsearch/cirrussearch/ [20:21:37] ebernhardson: for a single page? [20:21:43] Or the whole thing? [20:22:06] ostriches: the whole thing. Basically we are adjusting the reindexer to be able to do a copy between clusters. But we want to make sure we don't break what it already does [20:22:13] problem is, we can't figure out how to trigger whatever it already does [20:22:54] PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: puppet fail [20:22:58] forceSearchIndex or whatever the script is called should overwrite existing entries with the right options [20:24:18] ostriches: that looks to build directly from the wiki databases, rather than using the reindexer (sucks es documents out of one index, and the sticks them into a newly created index) [20:24:26] so it might be, the reindexer was not used much :) [20:24:26] (03PS2) 10BBlack: X-Client-IP 2/12 - rename ip_proc sub, move req.restarts guard [puppet] - 10https://gerrit.wikimedia.org/r/244202 (https://phabricator.wikimedia.org/T89177) [20:24:35] (03CR) 10BBlack: [C: 032 V: 032] X-Client-IP 2/12 - rename ip_proc sub, move req.restarts guard [puppet] - 10https://gerrit.wikimedia.org/r/244202 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [20:24:40] !log resetting drac on argon [20:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:25:18] ebernhardson: Ah, reindex based on existing entries? updateSearchIndex with the right options. [20:25:51] !log argon: installing package upgrades [20:25:51] ostriches: yea we've been playing with that and having no luck :( [20:25:51] hrm [20:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:04] ostriches: i thought perhaps changing the shard count would do it, since that requires a full index, but it just says "sorry, cant do that" [20:26:19] Hrm [20:26:23] oh well, we'll figure it out was just hoping you remembered :) [20:32:45] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1710260 (10GWicke) [20:35:19] (03PS2) 10Andrew Bogott: Openstack: set OS_IDENTITY_API_VERSION=3 [puppet] - 10https://gerrit.wikimedia.org/r/244333 [20:35:39] (03PS3) 10Andrew Bogott: Openstack: set OS_IDENTITY_API_VERSION=3 [puppet] - 10https://gerrit.wikimedia.org/r/244333 [20:36:49] (03CR) 10Andrew Bogott: [C: 032] Openstack: set OS_IDENTITY_API_VERSION=3 [puppet] - 10https://gerrit.wikimedia.org/r/244333 (owner: 10Andrew Bogott) [20:37:40] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1710269 (10Spage) a:3Ottomata [20:39:16] (03CR) 10Yuvipanda: [C: 031] "Thanks :) Do test!" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/239377 (https://phabricator.wikimedia.org/T109362) (owner: 10coren) [20:40:13] (03PS2) 10BBlack: X-Client-IP 3/12 - remove fe default on be guard [puppet] - 10https://gerrit.wikimedia.org/r/244203 (https://phabricator.wikimedia.org/T89177) [20:40:27] (03CR) 10BBlack: [C: 032 V: 032] X-Client-IP 3/12 - remove fe default on be guard [puppet] - 10https://gerrit.wikimedia.org/r/244203 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [20:41:14] 6operations, 10ops-codfw: audit for juniper switch QFX5100-48S-AFI - https://phabricator.wikimedia.org/T114952#1710303 (10RobH) [20:41:21] (03PS2) 10BBlack: X-Client-IP 4/12 - move XFF-setter out of recv_fe_ip_processing [puppet] - 10https://gerrit.wikimedia.org/r/244204 (https://phabricator.wikimedia.org/T89177) [20:41:27] (03CR) 10BBlack: [C: 032 V: 032] X-Client-IP 4/12 - move XFF-setter out of recv_fe_ip_processing [puppet] - 10https://gerrit.wikimedia.org/r/244204 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [20:50:34] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:51:13] (03PS1) 10Dzahn: mw-rc-irc: firewall rule for Apache [puppet] - 10https://gerrit.wikimedia.org/r/244343 (https://phabricator.wikimedia.org/T104943) [20:51:27] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/244343/" [puppet] - 10https://gerrit.wikimedia.org/r/223887 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [20:52:06] (03CR) 10Dzahn: [C: 032] mw-rc-irc: firewall rule for Apache [puppet] - 10https://gerrit.wikimedia.org/r/244343 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [20:52:56] (03PS3) 10Dzahn: argon: add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/223887 (https://phabricator.wikimedia.org/T104943) [20:53:40] (03CR) 10Dzahn: [C: 032] argon: add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/223887 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [20:56:56] !log applied firewalling on IRCd server, rc bot still working fine, all public IRC ports as before [20:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:57:04] Krenair: ^ [20:57:58] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for VBaranetsky - https://phabricator.wikimedia.org/T114308#1710355 (10Cmjohnson) 5Open>3Resolved Resolving this task. @VBaranetsky Please let us know if you experience any problems [21:01:54] mutante: if you're doing argon work; mind looking at the naming standardisation patch I submitted? [21:03:10] JohnFLewis: sorry, no, actually i did everything to not also do that motd change :p [21:03:17] that requires service restart [21:03:21] (03PS1) 10Ori.livneh: Revert "Don't use nutcracker on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244347 (https://phabricator.wikimedia.org/T102993) [21:03:23] which requires announcing it [21:03:32] users need to reconnect their clients etc [21:03:36] mutante: different motd :) [21:03:59] ah..fair [21:04:02] its the main server one and really doesn't need a restart honestly just to chain - to _'s :) [21:04:09] but that's another ticket somewhere [21:04:10] *reboot not restart [21:04:15] the other motd [21:04:30] ok [21:05:05] andrewbogott: I'm going to merge https://gerrit.wikimedia.org/r/#/c/244347/ -- the original change is rather ill-conceived [21:05:22] it also does mean puppet has 100% consistent module and manifests naming for module/roles so plus there :) [21:06:05] PROBLEM - puppet last run on db2019 is CRITICAL: CRITICAL: puppet fail [21:06:11] ori: that’s fine, as long as I can bug you when it dies on silver :) [21:07:14] I don't see how you "reduce complexity" by adding a configuration file "same-settings-as-everywhere-else-except-this-one-thing.php", which you then include if ( $wgImNotBranchingOnHostnameBecauseLookItsAConfigurationVariable ) { } [21:09:14] JohnFLewis: it looks good.. except it doesnt merge like this [21:09:30] hm? [21:09:34] andrewbogott: I'm always happy to help [21:09:42] JohnFLewis: click rebase button and see [21:09:52] (03PS2) 10Ori.livneh: Revert "Don't use nutcracker on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244347 (https://phabricator.wikimedia.org/T102993) [21:10:04] (03CR) 10Ori.livneh: [C: 032] Revert "Don't use nutcracker on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244347 (https://phabricator.wikimedia.org/T102993) (owner: 10Ori.livneh) [21:10:08] mutante: bah your merge :( [21:10:12] (03Merged) 10jenkins-bot: Revert "Don't use nutcracker on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244347 (https://phabricator.wikimedia.org/T102993) (owner: 10Ori.livneh) [21:13:16] (03CR) 10Billinghurst: [C: 031] "Codification of previous adhoc implementations. Nice." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244176 (owner: 10Dereckson) [21:14:18] 6operations, 7user-notice: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1710419 (10Dzahn) [21:14:48] (03PS3) 10John F. Lewis: mw_rc_irc: rename to standard naming scheme (underscores) [puppet] - 10https://gerrit.wikimedia.org/r/244236 [21:18:26] 6operations, 7user-notice: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1710435 (10Dzahn) I resolved the firewalling part of this without needing a maintenance downtime. I give it back to the pool/ up for grabs about the motd change and possibly IPv6. [21:19:35] 6operations, 7user-notice: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1710440 (10Dzahn) [21:19:46] 6operations, 7user-notice: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1710441 (10Dzahn) a:5Dzahn>3None [21:19:55] (03PS1) 10Andrew Bogott: Openstack: Don't notify keystone when the keystone policy changes [puppet] - 10https://gerrit.wikimedia.org/r/244349 [21:19:57] (03PS1) 10Andrew Bogott: Keystone: Adopt a multi-domain model [puppet] - 10https://gerrit.wikimedia.org/r/244350 [21:20:41] (03CR) 10jenkins-bot: [V: 04-1] Keystone: Adopt a multi-domain model [puppet] - 10https://gerrit.wikimedia.org/r/244350 (owner: 10Andrew Bogott) [21:21:42] (03PS2) 10Andrew Bogott: Keystone: Adopt a multi-domain model [puppet] - 10https://gerrit.wikimedia.org/r/244350 [21:22:34] (03CR) 10jenkins-bot: [V: 04-1] Keystone: Adopt a multi-domain model [puppet] - 10https://gerrit.wikimedia.org/r/244350 (owner: 10Andrew Bogott) [21:23:03] 6operations: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1710449 (10Dzahn) [21:23:14] 6operations: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1710450 (10Dzahn) p:5Normal>3Low [21:24:01] (03CR) 10Dzahn: "rebuilt as http://puppet-compiler.wmflabs.org/964/argon.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/244236 (owner: 10John F. Lewis) [21:24:10] (03CR) 10Dzahn: [C: 032] mw_rc_irc: rename to standard naming scheme (underscores) [puppet] - 10https://gerrit.wikimedia.org/r/244236 (owner: 10John F. Lewis) [21:24:36] JohnFLewis: arrr. or not [21:24:46] (03PS3) 10Andrew Bogott: Keystone: Adopt a multi-domain model [puppet] - 10https://gerrit.wikimedia.org/r/244350 [21:25:00] JohnFLewis: cant merge on master.. ehm [21:25:19] !log ori@tin Synchronized wmf-config/CommonSettings.php: Revert "Don't use nutcracker on wikitech" (duration: 00m 16s) [21:25:21] (03CR) 10jenkins-bot: [V: 04-1] Keystone: Adopt a multi-domain model [puppet] - 10https://gerrit.wikimedia.org/r/244350 (owner: 10Andrew Bogott) [21:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:25:52] (03PS1) 10Dzahn: Revert "mw_rc_irc: rename to standard naming scheme (underscores)" [puppet] - 10https://gerrit.wikimedia.org/r/244353 [21:28:55] (03CR) 10Dzahn: [C: 032] Revert "mw_rc_irc: rename to standard naming scheme (underscores)" [puppet] - 10https://gerrit.wikimedia.org/r/244353 (owner: 10Dzahn) [21:28:59] mutante: why the revert? [21:29:34] chasemp: it changed unrelated submodule [21:29:44] ah [21:29:54] i did the revert before merging on master [21:30:00] so that turned into nothing [21:32:04] (03CR) 10Dzahn: "there seems to be a general disagreement about base::firewall on nodes vs. roles that applies to this just like the other changes doing th" [puppet] - 10https://gerrit.wikimedia.org/r/242180 (owner: 10Muehlenhoff) [21:33:26] (03Abandoned) 10Dzahn: irc.wikimedia.org: Add ferm rule for Apache [puppet] - 10https://gerrit.wikimedia.org/r/244122 (https://phabricator.wikimedia.org/T104943) (owner: 10Muehlenhoff) [21:34:32] (03PS4) 10Andrew Bogott: Keystone: Adopt a multi-domain model [puppet] - 10https://gerrit.wikimedia.org/r/244350 [21:34:55] RECOVERY - puppet last run on db2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:34:56] 6operations, 10Deployment-Systems, 6Release-Engineering-Team, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1710501 (10thcipriani) Following up on subpoints, point-by-point, expanding on T109535#1691326 > - rolling deploys / config changes... [21:35:48] https://github.com/wikimedia/operations-puppet/commit/b3bb5614ff0c29bfa11fc66bb835455b8851ffcb [21:36:24] mutante: ^ see sudmodule change there. Likely where mine came from [21:37:17] Yeah, my bad revert it seems. I'll rematch tomorrow :) [21:37:35] JohnFLewis: ok, thank you [21:37:50] a clean revert just seemed safer than anything manual [21:38:06] (03PS2) 10Jforrester: VisualEditor: Enabled for logged-out users on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242042 (https://phabricator.wikimedia.org/T90662) [21:39:57] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [21:40:27] mutante: ^ you by an chance? :) [21:40:48] JohnFLewis: aaarrrr [21:40:55] No changes to merge. [21:41:06] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [21:42:14] should recover on palladium [21:43:07] icinga-wm: come on [21:43:26] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [21:44:27] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:44:30] 6operations, 6Phabricator, 7audits-data-retention: Enable mod_remoteip and ensure logs follow retention guidelines - https://phabricator.wikimedia.org/T114014#1710565 (10chasemp) [21:44:45] and this happens if you merge a change and an exact revert of that change [21:44:54] puppet-merge thinks there is nothing to do [21:45:07] so you have to git pull origin yourself [21:46:40] 6operations, 10Wikimedia-Mailing-lists: import old staff list archives ? - https://phabricator.wikimedia.org/T109395#1710588 (10Dzahn) Ideas for the best way to "un-stall" this? [21:47:06] JohnFLewis: https://phabricator.wikimedia.org/T114861 :p [21:47:24] thats still the bug mailing lists i think [21:47:40] lists who want every single phabricator action to be an email to list [21:47:51] and then do mass actions i suppose [21:48:32] i guess i want to question if those bug lists make sense in the first place [21:48:37] (03PS1) 10Ori.livneh: labs: define mediawiki::redis_servers::{eqiad,codfw} [puppet] - 10https://gerrit.wikimedia.org/r/244359 [21:48:40] it's back from Bugzilla... [21:48:48] (03PS2) 10Ori.livneh: labs: define mediawiki::redis_servers::{eqiad,codfw} [puppet] - 10https://gerrit.wikimedia.org/r/244359 [21:48:54] (03CR) 10Ori.livneh: [C: 032 V: 032] labs: define mediawiki::redis_servers::{eqiad,codfw} [puppet] - 10https://gerrit.wikimedia.org/r/244359 (owner: 10Ori.livneh) [21:49:12] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, and 2 others: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1710613 (10chasemp) Our ferm module doesn't seem to allow specification of the dst address at this moment. In this case we will have multi... [21:51:05] mutante, did you check if digests work in the new mailman? [21:51:43] we found some problems there, the digest script failed where there were bad characters [21:55:09] Platonides: I'm receiving digests so yes [21:57:01] I am too [21:57:07] Platonides: bad characters in mails? the encoding issues i know of were just in listinfo templates and description field [21:57:31] never heard about issues with mail content [21:57:55] 'bad characters in mails?' -> Yes, everyone on the internet is an asshole now and then [21:57:57] * yuvipanda slinks away [21:58:45] 6operations, 6Phabricator, 7audits-data-retention: Enable mod_remoteip and ensure logs follow retention guidelines - https://phabricator.wikimedia.org/T114014#1710664 (10BBlack) We'll soon have X-Client-IP available pre-decoded, so you don't have to mess with XFF and network/cache lists, etc. I think you ca... [21:58:47] (03PS1) 10Ori.livneh: Make the redis cache configuration multi-DC-ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244361 (https://phabricator.wikimedia.org/T111575) [21:58:47] PROBLEM - puppet last run on es2009 is CRITICAL: CRITICAL: puppet fail [21:59:13] (03PS2) 10Ori.livneh: Make the redis cache configuration multi-DC-ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244361 (https://phabricator.wikimedia.org/T111575) [21:59:32] 6operations: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1710667 (10Dzahn) @JohnLewis i agree, but i also wonder if we had "per queue"-levels, would we still monitor this queue and would it be useful with a threshold that is so high? i mean.. as opposed to just monito... [22:04:53] (03CR) 10Dpatrick: [C: 031] Don't open up the Kafka JMX port for debugging [puppet] - 10https://gerrit.wikimedia.org/r/244187 (owner: 10Muehlenhoff) [22:06:09] (03CR) 10Dzahn: [C: 031] Don't open up the Kafka JMX port for debugging [puppet] - 10https://gerrit.wikimedia.org/r/244187 (owner: 10Muehlenhoff) [22:07:05] (03CR) 10Ori.livneh: [C: 031] "Still LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [22:16:27] (03CR) 10Alex Monk: [WIP] Labs DNS: Stop hardcoding instance IPs in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243357 (owner: 10Alex Monk) [22:21:20] (03PS1) 10Andrew Bogott: Revert "Openstack: set OS_IDENTITY_API_VERSION=3" [puppet] - 10https://gerrit.wikimedia.org/r/244365 [22:22:37] 6operations, 6Release-Engineering-Team, 7Database, 5Patch-For-Review: Recover missing values from user_properties tables - https://phabricator.wikimedia.org/T114899#1710873 (10demon) p:5Triage>3Low [22:23:11] (03PS1) 10Dzahn: mailman: queue monitoring, enable multi thresholds [puppet] - 10https://gerrit.wikimedia.org/r/244366 (https://phabricator.wikimedia.org/T114861) [22:24:03] (03PS2) 10BBlack: X-Client-IP 5/12 - recv_fe_ip_proc frontend-only [puppet] - 10https://gerrit.wikimedia.org/r/244205 (https://phabricator.wikimedia.org/T89177) [22:24:40] (03CR) 10BBlack: [C: 032 V: 032] X-Client-IP 5/12 - recv_fe_ip_proc frontend-only [puppet] - 10https://gerrit.wikimedia.org/r/244205 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [22:25:46] RECOVERY - puppet last run on es2009 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [22:27:21] (03PS1) 10Dzahn: add typo domains to parking [dns] - 10https://gerrit.wikimedia.org/r/244367 (https://phabricator.wikimedia.org/T114922) [22:29:11] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1710932 (10Dzahn) @BBlack it looks we have 3 google docs, one each of us made. which one should we declare the "master" ?:) [22:31:31] (03PS2) 10Andrew Bogott: Revert "Openstack: set OS_IDENTITY_API_VERSION=3" [puppet] - 10https://gerrit.wikimedia.org/r/244365 [22:32:41] (03CR) 10Andrew Bogott: [C: 032] Revert "Openstack: set OS_IDENTITY_API_VERSION=3" [puppet] - 10https://gerrit.wikimedia.org/r/244365 (owner: 10Andrew Bogott) [22:35:45] 6operations, 10Wikimedia-DNS, 5Patch-For-Review, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1710961 (10Dzahn) Hi @VBaranetsky, I could confirm wikidpedia.org and wikipedial.org are owned by WMF and already point to our DNS servers. But wilkipedia.org is... [22:46:32] (03PS1) 10Andrew Bogott: Revert to a v2-friendlier version of the keystone policy.conf [puppet] - 10https://gerrit.wikimedia.org/r/244368 [22:48:15] (03PS2) 10Andrew Bogott: Revert to a v2-friendlier version of the keystone policy.conf [puppet] - 10https://gerrit.wikimedia.org/r/244368 [22:51:09] (03CR) 10Andrew Bogott: [C: 032] Revert to a v2-friendlier version of the keystone policy.conf [puppet] - 10https://gerrit.wikimedia.org/r/244368 (owner: 10Andrew Bogott) [22:51:29] (03CR) 10QChris: "> @QChris Ugh. Is the reliance on $1 documented somewhere?" [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [22:52:00] (03PS2) 10BBlack: X-Client-IP 6/12 - unset the 4x new headers [puppet] - 10https://gerrit.wikimedia.org/r/244206 (https://phabricator.wikimedia.org/T89177) [22:52:07] (03CR) 10BBlack: [C: 032 V: 032] X-Client-IP 6/12 - unset the 4x new headers [puppet] - 10https://gerrit.wikimedia.org/r/244206 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [22:54:17] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1710997 (10BBlack) I didn't create any that I'm aware, I'd say keep working on the one you made with all the fields in it. [22:54:27] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1710998 (10RobH) https://rt.wikimedia.org/Ticket/Display.html?id=9677 now has updated pricing info for a new single cpu misc system. DO NOT PUT PRICIN... [22:56:21] (03CR) 10Dzahn: [C: 04-1] "what qchris said. this has been tried multiple times by multiple people. i had to revert one of them." [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151007T2300). [23:00:04] jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:37] jdlrobson, hey [23:00:55] (03PS4) 10Alex Monk: Enable banners on all namespaces on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243837 (https://phabricator.wikimedia.org/T114566) (owner: 10Jdlrobson) [23:01:17] Krenair: hey hopefully this time no rrr issues :) [23:01:25] (03CR) 10Alex Monk: [C: 032] Enable banners on all namespaces on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243837 (https://phabricator.wikimedia.org/T114566) (owner: 10Jdlrobson) [23:01:32] (03Merged) 10jenkins-bot: Enable banners on all namespaces on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243837 (https://phabricator.wikimedia.org/T114566) (owner: 10Jdlrobson) [23:05:57] jdlrobson, looks good [23:06:11] Krenair: i can't seem to get it to work on talk pages.. [23:06:22] https://ru.wikivoyage.org/wiki/%D0%9E%D0%B1%D1%81%D1%83%D0%B6%D0%B4%D0%B5%D0%BD%D0%B8%D0%B5_%D1%83%D1%87%D0%B0%D1%81%D1%82%D0%BD%D0%B8%D0%BA%D0%B0:Jdlrobson [23:06:29] It's not deployed yet, jdlrobson [23:06:37] I just checked that it hasn't taken down the beta cluster. [23:06:44] ah that's what you were referring too :) [23:07:06] jdlrobson, I see lots of OOM on ruwikivoyage [23:07:16] (03PS1) 10QChris: Document gerrit's limitations for regexp matching in Phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/244370 [23:07:31] MaxSem: related to page banners? [23:07:35] err [23:07:39] ruwikisource [23:07:45] jdlrobson, if I had deployed it, you'd have seen the sync-file log [23:08:41] (03CR) 10QChris: "> > @QChris Ugh. Is the reliance on $1 documented somewhere?" [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [23:08:54] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/243837/ (duration: 00m 17s) [23:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:08] jdlrobson, ^ [23:10:34] Krenair: verified on user talk pages just checking for no regressions [23:11:30] (03PS2) 10BBlack: X-Client-IP 7/12 - Set X-T-P [puppet] - 10https://gerrit.wikimedia.org/r/244207 (https://phabricator.wikimedia.org/T89177) [23:11:38] (03CR) 10BBlack: [C: 032 V: 032] X-Client-IP 7/12 - Set X-T-P [puppet] - 10https://gerrit.wikimedia.org/r/244207 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [23:12:00] Krenair: perfect [23:12:03] looks great. Thanks! [23:12:05] Testing complete. [23:15:48] (03PS3) 10Alex Monk: Document translation namespace best practices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244176 (owner: 10Dereckson) [23:15:55] (03CR) 10Alex Monk: [C: 032] Document translation namespace best practices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244176 (owner: 10Dereckson) [23:16:02] (03Merged) 10jenkins-bot: Document translation namespace best practices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244176 (owner: 10Dereckson) [23:18:44] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/244176/ (duration: 00m 17s) [23:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:12] (03CR) 10Alex Monk: "This seems like a pain to review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [23:19:33] (03PS3) 10Alex Monk: Namespace configuration on bn.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244177 (https://phabricator.wikimedia.org/T114623) (owner: 10Dereckson) [23:22:04] (03CR) 10Alex Monk: [C: 032] Namespace configuration on bn.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244177 (https://phabricator.wikimedia.org/T114623) (owner: 10Dereckson) [23:22:10] (03Merged) 10jenkins-bot: Namespace configuration on bn.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244177 (https://phabricator.wikimedia.org/T114623) (owner: 10Dereckson) [23:22:44] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/244177/ (duration: 00m 17s) [23:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:54] (03PS3) 10Ori.livneh: Make the redis cache configuration multi-DC-ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244361 (https://phabricator.wikimedia.org/T111575) [23:27:06] (03CR) 10Ori.livneh: [C: 032] Make the redis cache configuration multi-DC-ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244361 (https://phabricator.wikimedia.org/T111575) (owner: 10Ori.livneh) [23:27:12] (03Merged) 10jenkins-bot: Make the redis cache configuration multi-DC-ready [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244361 (https://phabricator.wikimedia.org/T111575) (owner: 10Ori.livneh) [23:27:58] Krenair: were you done? [23:28:03] not really, but go ahead [23:28:22] Krenair: sorry, I'll be quick [23:29:25] (03PS2) 10Alex Monk: Enable Extension:ShortUrl on mr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242842 (https://phabricator.wikimedia.org/T103646) (owner: 10MarcoAurelio) [23:30:44] (03PS2) 10BBlack: X-Client-IP 8/12 - Set X-CIP [puppet] - 10https://gerrit.wikimedia.org/r/244208 (https://phabricator.wikimedia.org/T89177) [23:31:11] legoktm: what do we do with new ShortUrl deployments? [23:31:14] Lcawte|Away: see ^ [23:31:25] ori: heh I was still looking at that [23:31:38] yuvipanda, you mean how do we do them? [23:31:52] Krenair: should we do them at all [23:31:56] AaronSchulz: already synced to mw1017 and silver, going to do terbium last and then the rest of the cluster [23:31:58] ah. [23:32:11] Krenair: because UrlShortener is on the way [23:32:14] yuvipanda: probably continue doing them until UrlShortener is actually ready go to [23:32:47] AaronSchulz: i'll leave it to you to follow up with a patch to introduce the multiwritebagostuff instance [23:32:50] (03CR) 10BBlack: [C: 032 V: 032] X-Client-IP 8/12 - Set X-CIP [puppet] - 10https://gerrit.wikimedia.org/r/244208 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [23:34:27] !log ori@tin Synchronized wmf-config: I924d8e19e17: Make the redis cache configuration multi-DC-ready (T111575) (duration: 00m 17s) [23:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:35:21] Krenair: yours [23:35:26] k [23:36:00] (03CR) 10Alex Monk: [C: 032] Enable Extension:ShortUrl on mr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242842 (https://phabricator.wikimedia.org/T103646) (owner: 10MarcoAurelio) [23:36:06] (03Merged) 10jenkins-bot: Enable Extension:ShortUrl on mr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242842 (https://phabricator.wikimedia.org/T103646) (owner: 10MarcoAurelio) [23:36:14] ori: lgtm [23:36:57] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/242842/ (duration: 00m 17s) [23:36:58] cool [23:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:37:14] what is the best way to include a graph in the incident reports? upload to the relevant phab ticket and reference it? upload to commons? .. ? [23:37:55] subbu: phab ticket + reference [23:38:02] ok, thanks. [23:40:46] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1711132 (10ssastry) Since we didn't take a ganglia snapshot on the day of, here is a recreated graph from graphite...