[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151120T0000). [00:00:04] ebernhardson jhobs: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:14] i can take this one [00:00:17] i'm here [00:01:08] (03CR) 10EBernhardson: [C: 032] Enable new QuickSurveys survey on mobile enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254322 (https://phabricator.wikimedia.org/T118881) (owner: 10Jhobs) [00:01:20] (03PS1) 10Dzahn: icinga: fix IRC output of paging/critical services [puppet] - 10https://gerrit.wikimedia.org/r/254331 (https://phabricator.wikimedia.org/T118072) [00:01:52] (03Merged) 10jenkins-bot: Enable new QuickSurveys survey on mobile enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254322 (https://phabricator.wikimedia.org/T118881) (owner: 10Jhobs) [00:02:08] (03PS2) 10Dzahn: icinga: fix IRC output of paging/critical services [puppet] - 10https://gerrit.wikimedia.org/r/254331 (https://phabricator.wikimedia.org/T118072) [00:02:24] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1819432 (10GWicke) @jcrespo: I might need some convincing that all of those are absolutely needed. [00:02:33] (03CR) 10Dzahn: [C: 032] icinga: fix IRC output of paging/critical services [puppet] - 10https://gerrit.wikimedia.org/r/254331 (https://phabricator.wikimedia.org/T118072) (owner: 10Dzahn) [00:02:41] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT https://gerrit.wikimedia.org/r/#/c/254322/ (duration: 00m 20s) [00:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:02:59] jhobs: ^^ [00:04:00] (03CR) 10Dzahn: "what "needs verified"? you just said verified !" [puppet] - 10https://gerrit.wikimedia.org/r/254331 (https://phabricator.wikimedia.org/T118072) (owner: 10Dzahn) [00:04:35] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1819441 (10jcrespo) @GWicke Please answer the question. [00:05:09] jynus: I think I already gave an answer? [00:05:12] ebernhardson: looks like everything's working, thanks! [00:05:37] no, you said that "all not may be needed" [00:05:50] how can I query recentchanges? [00:05:52] yeah, which is my opinion [00:06:03] cannot find it on the documentation [00:06:13] is it not documented yet? [00:06:49] jhobs: thanks [00:06:58] 6operations, 7Icinga, 5Patch-For-Review: icinga-wm not outputing messages for alerts that also paged - https://phabricator.wikimedia.org/T118072#1819443 (10Dzahn) Btw,this only affected critical services that were ALSO database services. "grep sms puppet_services.cfg" on neon shows the contact groups for pa... [00:07:41] I suppose this is the stable version? http://rest.wikimedia.org/en.wikipedia.org/v1/?doc [00:08:23] jynus: We are exposing some end points that are very expensive, and this makes it difficult to prevent DOS attacks just by setting request rate limits as discussed in the task. I think we should limit the cost of API calls further. [00:09:01] jynus: this isn't about mysql vs. X at all [00:09:22] but you will have a plan to cache it eventually? [00:09:41] to cache what? [00:09:51] recentchanges and other apis [00:10:16] with all those variations plus the result changing every second, caching.. doesn't make so much sense [00:11:11] maybe all of those expensive options *are* needed, but I remain to be convinced on that [00:11:20] one would even say they have completelly different use cases and that comparing them may not even make sense [00:12:36] I do not like the slim API model [00:12:45] jynus: historically, the PHP API was motivated by support for editors, so it tends to be focused on low-volume, powerful end points [00:13:02] it eventually adds overhead on each roundtrip + http, etc. [00:13:20] the downside of that approach is high costs per call, and along with it a higher risk of DOS [00:13:36] !log ebernhardson@tin Synchronized php-1.27.0-wmf.7/extensions/CirrusSearch/: SWAT https://gerrit.wikimedia.org/r/254320 and https://gerrit.wikimedia.org/r/254319 (duration: 00m 20s) [00:13:36] plus, no caching [00:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:46] so yes, different focus areas [00:13:51] with different trade-offs [00:14:24] we're engineers, we trade offs! [00:14:29] so, given that you say that the are complete different use cases [00:14:45] which means one does not completely replace the other [00:15:00] (03CR) 10EBernhardson: [C: 032] CirrusSearch: Include all languages we can detect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253933 (https://phabricator.wikimedia.org/T118571) (owner: 10DCausse) [00:15:07] (03CR) 10EBernhardson: [C: 032] Turn on language detection user test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254070 (https://phabricator.wikimedia.org/T118290) (owner: 10EBernhardson) [00:15:11] how usage of one is going to improve the security of the other? [00:15:37] jynus: I don't know why you are trying to make this into a discussion about one api vs. another [00:15:44] (03Merged) 10jenkins-bot: CirrusSearch: Include all languages we can detect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253933 (https://phabricator.wikimedia.org/T118571) (owner: 10DCausse) [00:16:05] (03Merged) 10jenkins-bot: Turn on language detection user test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254070 (https://phabricator.wikimedia.org/T118290) (owner: 10EBernhardson) [00:16:51] you started the conversation by addinng me to a ticket [00:17:02] GWicke added a subscriber: jcrespo [00:17:30] jynus: I @mentioned you, which does that automatically [00:17:34] and saying that, if I understood well, that an api will offer protection for a DOS [00:17:47] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT https://gerrit.wikimedia.org/r/253933 and https://gerrit.wikimedia.org/r/254070 (duration: 00m 19s) [00:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:20] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: SWAT https://gerrit.wikimedia.org/r/253933 and https://gerrit.wikimedia.org/r/254070 (duration: 00m 19s) [00:18:21] jynus: that's not quite what I said; the discussion is about avoiding DOS attacks with request limits, and what the appropriate limits are for different APIs [00:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:19:14] we had this bug that "if an icinga check is both critical (paging) AND a database related thing"... then and only then it would not show up on IRC [00:19:17] fixed this [00:19:37] how was that possible? [00:19:52] 6operations, 6Collaboration-Team-Backlog, 7Database: Move echo tables from local wiki databases onto extension1 cluster for mediawikiwiki, metawiki, and officewiki - https://phabricator.wikimedia.org/T119154#1819482 (10Legoktm) 3NEW [00:20:14] there is a contact group called "admins". it has one member, called "irc" [00:20:25] that irc contact has special notification commands.. that echo into a log file [00:20:32] the irc bot reads from that log file [00:20:48] admins is the default group _unless_ it finds another primary contact group [00:21:00] lol, this will go deep [00:21:02] there is another special contact called "sms" [00:21:09] that does the paging [00:21:14] yes, i can make it shorter:) [00:21:32] $is_critical = $critical ? { [00:21:39] true => "${contact_group},sms", [00:21:50] default => $contact_group, [00:22:01] this is the check to decide if it should page or not [00:22:06] we ended up with this: [00:22:21] contact_groups dba,sms [00:22:38] that meant "email to dba, paging to all ops" but there is no "admins" [00:22:42] and admins means IRC [00:22:47] swat completed [00:23:07] the fix is to always add ",sms,admins" not just "sms" [00:23:25] this can lead to stuf fliuke "admins,sms,admins" but icinga doesn't care [00:23:34] ebernhardson: just in time for pie [00:23:40] greg-g: yup :) [00:24:14] (03PS5) 10Thcipriani: RESTBase configuration for scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/252887 [00:25:18] (03CR) 10jenkins-bot: [V: 04-1] RESTBase configuration for scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/252887 (owner: 10Thcipriani) [00:26:03] 6operations, 6Collaboration-Team-Backlog, 7Database: Move echo tables from local wiki databases onto extension1 cluster for mediawikiwiki, metawiki, and officewiki - https://phabricator.wikimedia.org/T119154#1819505 (10jcrespo) I was one of the people that wouldn't understand why this was like this in the fi... [00:26:31] 6operations, 7Icinga, 5Patch-For-Review: icinga-wm not outputing messages for alerts that also paged - https://phabricator.wikimedia.org/T118072#1819510 (10Dzahn) after my fix above and running puppet on neon, each time it runs a few contact groups are fixed. we have this now: contact_groups... [00:26:54] 6operations, 7Icinga: icinga-wm not outputing messages for alerts that also paged - https://phabricator.wikimedia.org/T118072#1819511 (10Dzahn) [00:27:28] jynus: more details on the ticket if you want to know ^ [00:27:33] (03PS6) 10Thcipriani: RESTBase configuration for scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/252887 [00:27:38] strange indeed [00:28:07] i think i also know which change broke it [00:28:24] it was one to allow us to page non-ops contact groups [00:44:18] 6operations, 6Collaboration-Team-Backlog, 7Database: Move echo tables from local wiki databases onto extension1 cluster for mediawikiwiki, metawiki, and officewiki - https://phabricator.wikimedia.org/T119154#1819541 (10Legoktm) >>! In T119154#1819505, @jcrespo wrote: > I was one of the people that wouldn't u... [00:45:18] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [5000000.0] [00:45:24] 6operations, 7Icinga: icinga-wm not outputing messages for alerts that also paged - https://phabricator.wikimedia.org/T118072#1819544 (10Dzahn) 5Open>3Resolved [00:48:55] 6operations, 6Collaboration-Team-Backlog, 7Database: Move echo tables from local wiki databases onto extension1 cluster for mediawikiwiki, metawiki, and officewiki - https://phabricator.wikimedia.org/T119154#1819554 (10jcrespo) a:3jcrespo My mistake, I assumed labswiki was part of the migration just becaus... [00:54:37] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [5000000.0] [00:56:36] 6operations, 10Wikimedia-DNS, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1819615 (10Dzahn) @VBaranetsky Hi, did you get any update meanwhile? [00:57:42] 6operations, 10Wikimedia-DNS, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1819616 (10Dzahn) a:5Dzahn>3VBaranetsky [00:58:23] 6operations, 10Wikimedia-DNS, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1709675 (10Dzahn) please feel free to assign it back to me if any updates. thank you [01:00:26] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [5000000.0] [01:02:18] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [01:02:50] 6operations: Remove exim aliases for cdeubner - https://phabricator.wikimedia.org/T118900#1819624 (10Dzahn) a:3Dzahn [01:04:19] ebernhardson: hey, I know the window just closed, but can we do an emergency deployment of a tiny config change? Our sample rate for a survey is too high. [01:05:16] (03PS1) 10Jhobs: Reduce sampling rate for QuickSurveys survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254352 [01:07:47] 6operations: Remove exim aliases for cdeubner - https://phabricator.wikimedia.org/T118900#1819640 (10Dzahn) The only line related to Chip we had on the operations side in exim was this: -chip: cdeubner I removed it. But i feel like there might be more that is actually on the OIT side in Google. [01:08:24] 6operations: Remove exim aliases for cdeubner - https://phabricator.wikimedia.org/T118900#1819642 (10Dzahn) 5Open>3Resolved ah ,nevermind you actually said this specific alias. so i declare it done [01:08:59] jhobs: do you have the patch ready? [01:09:04] yes [01:09:09] https://gerrit.wikimedia.org/r/254352 [01:10:00] (03CR) 10BryanDavis: [C: 032] Reduce sampling rate for QuickSurveys survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254352 (owner: 10Jhobs) [01:10:21] (03Merged) 10jenkins-bot: Reduce sampling rate for QuickSurveys survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254352 (owner: 10Jhobs) [01:11:26] (03PS2) 10Dzahn: grafana: disable gravatar integration [puppet] - 10https://gerrit.wikimedia.org/r/254324 (owner: 10JanZerebecki) [01:11:36] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: Reduce sampling rate for QuickSurveys survey (28c69d2) (duration: 00m 20s) [01:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:11:46] jhobs: ^ [01:11:51] thanks bd808! [01:12:22] (03CR) 10Dzahn: [C: 032] "yea, we also disabled this in gitblit for the same reason, so also doing it here" [puppet] - 10https://gerrit.wikimedia.org/r/254324 (owner: 10JanZerebecki) [01:13:44] oops, forgot to re-enable puppet on krypton [01:14:09] and has to fix racktables because Apache 2.4 vs. 2.2 [01:15:13] twentyafterfour, thcipriani|afk: we should get the latest scap deployed to fix the problems with syncing to mira [01:15:33] * bd808 forgot to put the sarcastic quotes around that "we" [01:18:13] (03PS1) 10Dzahn: racktables: adjust Apache config to 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/254353 (https://phabricator.wikimedia.org/T105555) [01:19:18] (03PS2) 10Dzahn: racktables: adjust Apache config to 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/254353 (https://phabricator.wikimedia.org/T105555) [01:19:46] (03CR) 10Dzahn: [C: 032 V: 032] racktables: adjust Apache config to 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/254353 (https://phabricator.wikimedia.org/T105555) (owner: 10Dzahn) [01:23:49] (03PS1) 10Ori.livneh: webperf: log per-user-agent Navigation Timing data [puppet] - 10https://gerrit.wikimedia.org/r/254355 (https://phabricator.wikimedia.org/T112594) [01:29:04] !log catrope@tin Synchronized php-1.27.0-wmf.7/extensions/Echo/includes/formatters/BasicFormatter.php: unstub wgLang to fix type hint errors (duration: 00m 20s) [01:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:29:18] (03PS2) 10Ori.livneh: webperf: log per-user-agent Navigation Timing data [puppet] - 10https://gerrit.wikimedia.org/r/254355 (https://phabricator.wikimedia.org/T112594) [01:30:10] !log catrope@tin Synchronized php-1.27.0-wmf.7/resources/lib/jquery.i18n/src/jquery.i18n.language.js: Unbreak IE8 support in jquery.i18n (duration: 00m 25s) [01:30:13] (03PS3) 10Ori.livneh: webperf: log per-user-agent Navigation Timing data [puppet] - 10https://gerrit.wikimedia.org/r/254355 (https://phabricator.wikimedia.org/T112594) [01:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:31:19] !log catrope@tin Synchronized php-1.27.0-wmf.7/includes/widget/TitleInputWidget.php: Fix conflicting configuration name in TitleInputWidget (duration: 00m 20s) [01:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:31:39] !log catrope@tin Synchronized php-1.27.0-wmf.7/resources/src/mediawiki.widgets/mw.widgets.TitleWidget.js: Fix conflicting configuration name in TitleInputWidget (duration: 00m 19s) [01:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:31:57] PROBLEM - Host mw1041 is DOWN: PING CRITICAL - Packet loss = 100% [01:32:32] (03PS4) 10Ori.livneh: webperf: log per-user-agent Navigation Timing data [puppet] - 10https://gerrit.wikimedia.org/r/254355 (https://phabricator.wikimedia.org/T112594) [01:32:59] (03CR) 10Ori.livneh: [C: 032 V: 032] "Existing metrics will not be disturbed. New metrics can be dropped if there is an issue with this code." [puppet] - 10https://gerrit.wikimedia.org/r/254355 (https://phabricator.wikimedia.org/T112594) (owner: 10Ori.livneh) [01:35:15] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1819682 (10Tbayer) Below is a closer look on how these numbers developed over time, for two of the three tables examined in the task. Seems something happened in June, around the... [01:35:31] (03PS2) 10Alex Monk: Use a more useful error message when DB connection fails [software/dbtree] - 10https://gerrit.wikimedia.org/r/251791 [01:45:36] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [01:50:35] (03PS1) 10Ori.livneh: webperf/navtiming: Fix version string normalization [puppet] - 10https://gerrit.wikimedia.org/r/254358 [01:50:56] (03CR) 10Ori.livneh: [C: 032 V: 032] webperf/navtiming: Fix version string normalization [puppet] - 10https://gerrit.wikimedia.org/r/254358 (owner: 10Ori.livneh) [02:02:16] PROBLEM - Host rigel is DOWN: PING CRITICAL - Packet loss = 100% [02:03:36] frack, codfw [02:04:50] Rigel is codfw fr? [02:05:07] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [02:05:17] RECOVERY - Host rigel is UP: PING OK - Packet loss = 0%, RTA = 37.51 ms [02:09:06] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [02:09:47] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [02:10:08] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [02:15:07] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [02:20:08] RECOVERY - check_puppetrun on payments2003 is OK: OK: Puppet is currently enabled, last run 186 seconds ago with 0 failures [02:30:07] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 603 [02:33:17] PROBLEM - Hadoop NodeManager on analytics1052 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:34:09] !log l10nupdate@tin Synchronized php-1.27.0-wmf.7/cache/l10n: l10nupdate for 1.27.0-wmf.7 (duration: 10m 47s) [02:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:07] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 903 [02:40:07] RECOVERY - check_mysql on lutetium is OK: Uptime: 1938733 Threads: 2 Questions: 59384152 Slow queries: 18707 Opens: 76655 Flush tables: 2 Open tables: 64 Queries per second avg: 30.630 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [02:40:17] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 1 failures [02:49:57] RECOVERY - Hadoop NodeManager on analytics1052 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:51:26] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [02:53:42] thanks bd808 for that save for the sampling rate. I was told about it at our mixer and gave leila the thumbs up. [02:54:24] jhobs: thanks for the quick response as well. [03:06:36] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:08:07] greg-g: np. changes like that are a no brainer :) [03:08:43] I need to remember to ask if there are other reading folks who want to get comfortable deploying stuff [03:10:57] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [03:11:38] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [03:22:05] bd808: yes, please! [03:22:26] yet another reason to get MW migrated to scap3 [03:50:02] Where is the best documentation on scap3 (if there is any)? [03:51:06] Negative24: https://doc.wikimedia.org/mw-tools-scap/scap3/index.html [03:51:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 1 below the confidence bounds [03:54:25] twentyafterfour: ah and that is a doc site I have never seen before [04:10:46] PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: puppet fail [04:14:36] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 9 below the confidence bounds [04:36:46] !log deploying scap on tin [04:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:38:48] RECOVERY - puppet last run on lvs2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:38:50] (03CR) 1020after4: [C: 032] Revert "checkoutMediaWiki: sudo as mwdeploy for most things" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253684 (owner: 10BryanDavis) [04:39:33] (03Merged) 10jenkins-bot: Revert "checkoutMediaWiki: sudo as mwdeploy for most things" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253684 (owner: 10BryanDavis) [04:41:14] !log twentyafterfour@tin Synchronized multiversion/checkoutMediaWiki.php: deploying I2ec91aa33e05c2dc367db8a6f6ba56be5397906b to test scap mirroring (duration: 00m 20s) [04:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:43:00] mw1041 is down intentionally? [04:44:26] the last thing about it in sal was it being restarted on 2015-11-12 [04:45:02] it's completely down, no route to host [04:45:20] also, the scap master-replication failed [04:45:27] blerg! [04:45:36] what's broke about it this time? [04:45:54] CalledProcessError: Command '['sudo', '-u', 'mwdeploy', '-g', 'wikidev', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/cache/l10n/*.cdb', '--exclude=*.swp', '--no-perms', 'tin.eqiad.wmnet::common', '/srv/mediawiki-staging']' returned non-zero exit status 1 [04:46:07] I thought it was not supposed to be calling rsync directly now [04:46:45] it's not but I don't think scap has been updated yet on tin [04:47:00] I just updated it, or I thought I did [04:47:37] the patch is there... [04:47:57] but I guess mira didn't get updated? [04:48:23] indeed it didn't [04:49:09] trebuchet didn't whine? [04:50:01] now it's updated? [04:50:57] trebuchet said 456/483 minions completed fetch; 455/483 minions completed checkout [04:51:07] I can't figure out how to retry the ones that didn't complete [04:51:13] I hate trebuchet [04:51:34] you "retry" by waiting longer [04:51:47] * greg-g chuckles [04:52:01] the detailed report will tell you what actually is going on (which minions have errors or just haven't responded) [04:52:12] but I can see it is updated now on mira [04:52:19] bd808: it didn't give me a chance, it just said deployment complete... [04:52:27] bd808: weird [04:52:31] ok I'll try again [04:52:32] the joy of salt's eventually consistent model [04:53:45] bd808: "eventually" "consistent" [04:53:59] 04:53:08 Started sync-masters [04:54:01] sync-masters: 0% (ok: 0; fail: 0; left: 1) [04:54:17] so far it's just hanging, I don't know if it's doing something but I suspect it might be ;) [04:54:37] it will look like it is hanging because there is just the one host [04:54:58] I see a running rsync process [04:55:04] (on mira) [04:55:06] yeah I see two actually [04:55:23] !log twentyafterfour@tin Synchronized multiversion/checkoutMediaWiki.php: inconsequential sync to test mirroring (duration: 02m 14s) [04:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:56:33] bd808: ok I got some permission denied messages from mira :-/ [04:56:37] how does root get denied? [04:56:41] so to see the status of the last trebuchet deploy with details you can run `git deploy report --detailed sync` [04:57:03] were they "can't delete non empty directory" warnings? [04:57:28] no actually the errors were after sync-masters finished, so I guess that part worked! :) [04:57:37] w00t [04:57:42] progress [04:57:47] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [04:58:33] https://phabricator.wikimedia.org/P2335 [04:59:00] I see the last config change I synced today on mira via git log [04:59:03] which is nice [04:59:34] yeah looks like it sync'd properly! :) [04:59:42] k so the cdb rebuild is still having issues [04:59:43] now as to why the rest of the sync failed.... [05:00:08] these permissions issues are getting ridiculous [05:00:26] "drwxr-xr-x 3 trebuchet l10nupdate 4096 Nov 17 20:43 /srv/mediawiki-staging/php-1.27.0-wmf.7/cache/l10n/" [05:00:28] wtf? [05:00:58] oh for fuck sake [05:01:11] the uids don't match across the hosts [05:01:42] on tin: uid=997(l10nupdate) gid=10002(l10nupdate) groups=10002(l10nupdate) [05:01:53] on mira: uid=12162(l10nupdate) gid=10002(l10nupdate) groups=10002(l10nupdate) [05:02:47] so the rsync worked perfectly but the hosts have lame configuration because we let puppet randomly pick uids [05:03:07] we had tons of these in beta cluster when we first set it up [05:04:31] it shouldn't be too hard for a root to fix puppet and renumber the uids on mira [05:05:04] bd808: ok want me to make a ticket? [05:05:14] yes please [05:13:53] 6operations, 10Deployment-Systems: uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1819881 (10mmodell) 3NEW [05:25:48] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [06:04:34] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1820035 (10ori) Thanks for the detailed investigation, @Tbayer. Querying the WikimediaBlogVisit data was a very clever idea -- I wish I had thought of it :) I think we should ju... [06:30:16] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:47] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:46] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:36] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:37] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:47] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:07] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:08] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:56] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:08] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:49:23] bd808: seriously? [06:49:31] we had tons of these in beta cluster when we first set it up [06:49:34] it shouldn't be too hard for a root to fix puppet and renumber the uids on mira [06:50:04] 2 does not follow 1 [06:50:29] if it happened *again*, that's an even stronger sign that it needs to be fixed properly rather than hacked around by a root [06:51:32] <_joe_> ori: no I think what actually happened is that tin is around since forever and uids were created with different settings than the current ones [06:51:52] <_joe_> but I will take a better look, I have to reimage tin soon [06:55:56] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:37] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:57] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:16] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:17] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:26] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:27] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:06] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:58:16] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:26] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:36] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:46] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:32:57] <_joe_> and icinga is a christmas tree again [07:36:07] <_joe_> !log powercycled mw1041 [07:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:37:06] RECOVERY - Host mw1041 is UP: PING WARNING - Packet loss = 50%, RTA = 67.31 ms [07:42:20] (03PS3) 10Muehlenhoff: Exclude apport from toollabs genpp python list [puppet] - 10https://gerrit.wikimedia.org/r/254156 [07:43:47] RECOVERY - salt-minion processes on pybal-test2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:43:48] RECOVERY - Disk space on pybal-test2001 is OK: DISK OK [07:43:48] RECOVERY - DPKG on pybal-test2001 is OK: All packages OK [07:44:01] <_joe_> !log rebooted pybal-test2001 [07:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:44:07] RECOVERY - RAID on pybal-test2001 is OK: OK: no RAID installed [07:44:38] RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up [07:45:08] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:45:26] RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient [07:45:47] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:54:02] (03CR) 10Yuvipanda: [C: 04-1] "I think you need to *run* python.py and commit the changes too." [puppet] - 10https://gerrit.wikimedia.org/r/254156 (owner: 10Muehlenhoff) [07:56:41] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1820111 (10Tbayer) >>! In T119144#1820035, @ori wrote: > I think we should just drop this field and the associated code. I cannot recall a single case of it being used for its i... [08:02:46] RECOVERY - NTP on pybal-test2001 is OK: NTP OK: Offset -0.003401756287 secs [08:04:39] (03PS4) 10Merlijn van Deen: Exclude apport from toollabs genpp python list [puppet] - 10https://gerrit.wikimedia.org/r/254156 (owner: 10Muehlenhoff) [08:05:02] (03CR) 10Muehlenhoff: "Of course, but I was rather planning a followup commit (first fix the "source", then "build" it). But I can also integrate them into this " [puppet] - 10https://gerrit.wikimedia.org/r/254156 (owner: 10Muehlenhoff) [08:05:20] (03CR) 10Merlijn van Deen: [C: 04-1] "u-a should use the --minimal-upgrade-steps flag so it can be interrupted safely with SIGINT" [puppet] - 10https://gerrit.wikimedia.org/r/254295 (owner: 10Merlijn van Deen) [08:32:26] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [08:38:55] 6operations, 10RESTBase, 6Revscoring, 6Services, and 2 others: Set up revscoring entry points in RESTBase - https://phabricator.wikimedia.org/T107196#1489425 (10Joe) [08:42:50] (03PS5) 10Muehlenhoff: Exclude apport from toollabs genpp python list [puppet] - 10https://gerrit.wikimedia.org/r/254156 [08:43:56] (03CR) 10Yuvipanda: [C: 031] "I don't really think this requires public announcement though - is it really possible for user's tools to directly be calling into it?" [puppet] - 10https://gerrit.wikimedia.org/r/254156 (owner: 10Muehlenhoff) [08:45:53] (03CR) 10Muehlenhoff: "I don't think so, all meaning invocations of python-apport would require a local apport installation anyway." [puppet] - 10https://gerrit.wikimedia.org/r/254156 (owner: 10Muehlenhoff) [08:59:07] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:02:12] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: establish new thresholds for cassandra alarms after switching restbase to dtcs - https://phabricator.wikimedia.org/T118976#1820146 (10fgiunchedi) leaving this open, pending confirmation that new threshold are ok [09:24:27] (03Abandoned) 10Yuvipanda: mediawiki: Ensure that /etc/php5/apache dir exists [puppet] - 10https://gerrit.wikimedia.org/r/196773 (https://phabricator.wikimedia.org/T88442) (owner: 10Yuvipanda) [09:37:35] (03CR) 10Giuseppe Lavagetto: [C: 031] monitoring: fail on graphite metrics using single quotes [puppet] - 10https://gerrit.wikimedia.org/r/252963 (https://phabricator.wikimedia.org/T118398) (owner: 10Filippo Giunchedi) [09:39:06] 6operations, 10RESTBase-Cassandra: use correct datacenter/rack for cassandra nodes - https://phabricator.wikimedia.org/T89657#1820192 (10fgiunchedi) [09:40:48] (03CR) 10Giuseppe Lavagetto: [C: 031] "Looks good, I can't help but wonder why are we running a fork of trebuchet..." [puppet] - 10https://gerrit.wikimedia.org/r/254128 (https://phabricator.wikimedia.org/T118380) (owner: 10Filippo Giunchedi) [09:40:51] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: rename cassandra cluster - https://phabricator.wikimedia.org/T112257#1820193 (10fgiunchedi) 5Open>3stalled stalling as figuring out a procedure to rename safely is still needed but we've agreed to leave eqiad alone for now [09:42:29] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: establish new thresholds for cassandra alarms after switching restbase to dtcs - https://phabricator.wikimedia.org/T118976#1820196 (10fgiunchedi) 5Open>3stalled [09:43:41] (03PS4) 10Filippo Giunchedi: monitoring: fail on graphite metrics using single quotes [puppet] - 10https://gerrit.wikimedia.org/r/252963 (https://phabricator.wikimedia.org/T118398) [09:43:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] monitoring: fail on graphite metrics using single quotes [puppet] - 10https://gerrit.wikimedia.org/r/252963 (https://phabricator.wikimedia.org/T118398) (owner: 10Filippo Giunchedi) [09:44:46] 6operations, 5Patch-For-Review: Add monitoring of upload rate on commons to icinga alerts - https://phabricator.wikimedia.org/T92322#1820201 (10fgiunchedi) [09:44:47] 6operations, 7Graphite, 5Patch-For-Review: icinga strips single quotes from metric names for check_graphite - https://phabricator.wikimedia.org/T118398#1820199 (10fgiunchedi) 5Open>3Resolved not really fixed per se, but puppet will barf now [10:06:56] !log Running SecurePoll's arbcomlist.php on terbium with performance fix from https://gerrit.wikimedia.org/r/#/c/254375/ , per request from Jamesofur. Looks to be behaving well, can be killed if it acts up. [10:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:13:58] (03PS1) 10Addshore: WDQS Also use queryStartCount counter [puppet] - 10https://gerrit.wikimedia.org/r/254378 [10:19:12] (03PS2) 10Addshore: WDQS Also use queryStartCount counter [puppet] - 10https://gerrit.wikimedia.org/r/254378 (https://phabricator.wikimedia.org/T119178) [10:37:06] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [10:39:27] PROBLEM - Hadoop NodeManager on analytics1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:39:47] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [10:40:14] o/ Anyone around? [10:40:36] Odd issue with Commons… and it only seems to affect me. [10:41:18] I can edit, but all day today when I try to look at my watchlist I get a 503 error. [10:42:24] I’m thinking it might have something to do with what server I’m hitting… to me, commons.wikimedia.org is 208.80.153.224. [10:43:16] RECOVERY - Hadoop NodeManager on analytics1043 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:43:43] text-lb.codfw.wikimedia.org [10:44:14] andrewbogott: Ping? ^^ [10:46:37] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [5000000.0] [10:47:41] <_joe_> Revent: how long is your watchlist? [10:48:07] <_joe_> I can try to search for specific errors [10:48:54] It’s massive, actually. [10:49:14] <_joe_> has it changed a lot lately? [10:49:17] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [5000000.0] [10:49:36] Ridiculously so, due to not remembering to turn off ‘auto watchinglist’ before using VFC. [10:49:48] I’ve been removing pages from it quite a bit, yes. [10:50:11] <_joe_> Revent: so you are removing pages, not adding new ones [10:50:17] Yes. [10:50:27] I might have added a few, but not many. [10:50:47] <_joe_> because I know very very long watchlists can have issues, I've seen other users with very long watchlists having problems in the past [10:52:01] I should probably manually edit it… have actually been intending to, but even trying to load it in the editor has been unusably slow, and now I can’t even look at it. [10:52:18] <_joe_> Revent: let me take a look [10:52:26] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [10:52:44] I did change ‘Maximum number of changes to show in expanded watchlist’ from 250 to 1000 recently. [10:52:47] <_joe_> Revent: your watchlist on commons? [10:52:50] Yes. [10:53:11] <_joe_> ah ok, so that means I cannot reproduce your request properly if not logged in as you :/ [10:53:16] Fair warning, it’s size is just stupid. [10:53:27] <_joe_> a thing I cannot do, so let's go back to looking at logs :) [10:54:00] I’ll see if switching it back to 250 helps... [10:54:18] <_joe_> it probably would, I still want to see what's the issue here [10:54:27] <_joe_> but yeah that would probably fix your contingent problem [10:54:35] <_joe_> I hope so, at least [10:55:04] Doesn’t look like it, actually, haven’t gotten the 503 yet, but it’s just spinning trying to load it. [10:56:17] Yeah, and 503. [10:56:48] No error message, just “Service temporarily unavailable” [10:56:57] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [10:57:02] <_joe_> a wmf error page, right? [10:57:08] Indeed. [10:57:30] “Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon.” and so forth. [10:57:51] <_joe_> Ok that means that the caching server (to which you talk directly, more or less) is not getting a valid response from any backend [10:58:02] <_joe_> I am trying to search the logs for related errors now [10:58:53] There was something similar the other day… not directly, but codfw wasn’t updating an image thumbnail.. everyone else but me saw it switch, until bawolff fiddled it somehow. [10:59:14] (why I mentioned the IP I hit) [11:00:22] <_joe_> Revent: codfw is just a caching pop atm, I don't know anything about the thumbnail issue the other day, but I'm pretty sure it has nothing to do with your current problem [11:00:48] Fair enough. [11:01:34] <_joe_> Revent: sorry it will take me some time to figure out what's happening, I might open a bug later. Are you on phabricator by any chance? If so, I'll subscribe you to the ticket [11:01:48] Yes, I am. [11:01:54] Same name. [11:02:04] Would appreciate it... [11:14:21] (03PS1) 10Muehlenhoff: Further finetuning to debdeploy server groups [puppet] - 10https://gerrit.wikimedia.org/r/254381 [11:15:29] (03CR) 10Muehlenhoff: [C: 032 V: 032] Further finetuning to debdeploy server groups [puppet] - 10https://gerrit.wikimedia.org/r/254381 (owner: 10Muehlenhoff) [11:18:30] <_joe_> Revent: the only related existing bug I found is https://phabricator.wikimedia.org/T41510 [11:18:42] <_joe_> and no useful message in the logs [11:19:26] (heh) Yeah, my watchlist if unfortunately far more insane than that, I should probably just nuke it. [11:19:35] <_joe_> Revent: oh, I see [11:19:37] <_joe_> yeah :) [11:20:15] <_joe_> we should still open a bug maybe, but I think that's far beyond the expected capabilites of watchlists :P [11:20:34] <_joe_> Revent: ballpark estimate: how long is that watchlist? [11:20:34] Like I said, due to VFC… I was doing a ‘mass correction’ of what license Fae had used on a huge pile of bot uploads, to ‘combine’ multiple licenses, and it exploded. [11:20:39] 80k or so. [11:20:45] <_joe_> ok [11:20:52] It’s riduculous. [11:21:01] <_joe_> I'll file a bug, but I don't have a lot of info :/ [11:21:04] Admittedly, so long as to be largely unusable. [11:21:35] I’ve been removing hundreds a day as they get edited, would have thinned it down manually but the editor would not load. [11:21:38] <_joe_> (not your fault I don't have a lot of info, too) [11:22:07] <_joe_> Revent: heh, if you need a working watchlist _right_now_, nuke it [11:22:13] I would not be stunned if it’s the largest watchlist on the wiki, by far. [11:22:58] Yeah, will do so… mainly was waiting to see if stll having it ‘broken’ would be useful in checking if it was fixed. [11:23:41] <_joe_> yeah that's what i was about to ask [11:24:07] Aaand… spinny spinny when trying to clear it… waiting. [11:24:12] <_joe_> if you can live without it for a few hours, i/someone more involved with the code can take a look [11:24:32] Looks like it might fail on deleting it. [11:24:51] <_joe_> that is a big delete for the db, i guess [11:26:14] <_joe_> ok my best guess at the moment is: the backend does respond to the caching layer, but in a time that is way longer than the timeout or requests [11:26:31] <_joe_> that is why we see a 503 but I cannot find any error in the logs [11:27:11] Ok.... [11:27:17] A database query error has occurred. This may indicate a bug in the software. [11:27:18] Function: SpecialEditWatchlist::clearWatchlist [11:27:19] Error: 2013 Lost connection to MySQL server during query (10.64.16.29) [11:27:27] <_joe_> heh [11:28:11] So yeah, you’re right. [11:28:30] Need a dev, I guess, to do it directly. [11:29:27] (I can, if needed, make an edit to verify the request) [11:29:31] <_joe_> yes, I'd prefer not to mess with the db directly [11:29:35] I’m cloaked, tho. [11:30:52] Ok, on a second try… “Cleared watchlist”… “There are too many pages to display here. [11:30:53] Return to Special:Watchlist.” [11:31:03] It did nuke it, tho. [11:31:26] (0 pages watched) [11:31:36] <_joe_> ok at least that worked [11:31:55] Yeah, guess the query sill ran even though it couldn’t give a response. [11:32:19] Thanks for the help, I’ll remember not to watchlist 80,000 pages again. :P [11:32:19] <_joe_> I still have a couple of bugs to look at, but yeah I don't think anyone tried to optimize the extension for 80k long lists :P [11:32:37] <_joe_> yw [11:33:14] I freely admit is was insane, I was speding far too much time removing things one at a time. [11:33:17] *it [11:34:03] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1820349 (10faidon) Is there any evidence (or even credible suspicion) that legitimate clients in the wild are hitting these limits? [11:48:57] someone restarting esams' LVSes? [11:49:07] or restarted them 10-15 minutes ago? [11:49:31] <_joe_> paravoid: not that I know of [11:49:55] <_joe_> as in restarting the boxes? [11:50:14] restarting pybal [11:50:30] <_joe_> paravoid: it might have died? which box? [11:51:42] me neither [11:52:52] Nov 20 11:39:42 lvs3001 pybal[23731]: File "/usr/lib/python2.7/dist-packages/pybal/bgp.py", line 1350, in _closeConnection [11:52:55] Nov 20 11:39:42 lvs3001 pybal[23731]: if self.bgpPeering: self.bgpPeering.connectionClosed(self.protocol) [11:52:58] Nov 20 11:39:42 lvs3001 pybal[23731]: File "/usr/lib/python2.7/dist-packages/pybal/bgp.py", line 1997, in connectionClosed [11:53:01] Nov 20 11:39:42 lvs3001 pybal[23731]: assert self.fsm.state == ST_IDLE [11:53:04] Nov 20 11:39:42 lvs3001 pybal[23731]: exceptions.AssertionError: [11:53:30] hmm [11:53:33] something really bad happened [11:53:41] right above this [11:53:41] Nov 20 11:39:41 lvs3001 systemd[1]: systemd-logind.service watchdog timeout (limit 1min)! [11:53:45] Nov 20 11:39:41 lvs3001 systemd[1]: Unit systemd-logind.service entered failed state. [11:53:48] Nov 20 11:39:41 lvs3001 systemd[1]: systemd-logind.service has no holdoff time, scheduling restart. [11:53:51] Nov 20 11:39:41 lvs3001 systemd[1]: Stopping Login Service... [11:53:53] Nov 20 11:39:41 lvs3001 systemd[1]: Starting Login Service... [11:53:57] and then kernel stacks [11:54:07] <_joe_> so just that machine? [11:54:16] no [11:54:17] lvs3003 too [11:54:35] <_joe_> uhm the active and backup pair [11:54:45] yes mr. obvious :P [11:54:59] starts at [11:54:59] Nov 20 11:38:31 lvs3001 kernel: [4927521.503625] warn_alloc_failed: 6836772 callbacks suppressed [11:55:09] http://ganglia.wikimedia.org/latest/?c=LVS%20loadbalancers%20esams&h=vl100-eth0.lvs3001.esams.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [11:55:12] fun [11:55:45] <_joe_> whoa [11:57:40] neat [12:23:16] PROBLEM - puppet last run on mw2182 is CRITICAL: CRITICAL: Puppet has 1 failures [13:13:16] (03PS2) 10DCausse: Rename timestamp to ts for CirrusSearchRequestSet [puppet] - 10https://gerrit.wikimedia.org/r/252432 (https://phabricator.wikimedia.org/T117873) [13:14:02] (03CR) 10DCausse: [C: 04-1] "We should deploy Ie575f471 first." [puppet] - 10https://gerrit.wikimedia.org/r/252432 (https://phabricator.wikimedia.org/T117873) (owner: 10DCausse) [13:19:07] RECOVERY - puppet last run on mw2182 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:33:53] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1820452 (10zhuyifei1999) p:5Triage>3High I can confirm this issue again in: # [[https://commons.wikimedia.org/wiki/File:Faro_card_game.... [13:46:45] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, 7Varnish: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1820459 (10zhuyifei1999) Issue likely to be related to varnish cache (with [[https://wikitech.wikimedia.org/wiki/Debugging_in_pr... [13:53:05] 6operations, 10ops-eqiad, 10netops: cr1-eqiad PEM 0 fan failed - https://phabricator.wikimedia.org/T118721#1820462 (10Cmjohnson) 5Open>3Resolved Tracking Number RMA R405131-1 – 1ZA883E59093552038 [13:55:21] Hi! Where can I find the MediaWiki system messages using on the Watchlist page as "Show last:" and "Associated namespace" for example [13:55:36] Hi Raymond_ [13:55:39] Hi! Where can I find the MediaWiki system messages using on the Watchlist page as "Show last:" and "Associated namespace" for example [13:55:44] doctaxon: https://www.mediawiki.org/wiki/Qqx [13:55:48] https://translatewiki.net/wiki/Special:SearchTranslations [13:55:56] Also, this has nothing to do with operations. [13:56:23] Thank you [14:09:19] (03PS1) 10Muehlenhoff: Bugfixes [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/254395 [14:09:44] (03PS2) 10Muehlenhoff: Bugfixes [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/254395 [14:10:02] (03CR) 10Muehlenhoff: [C: 032 V: 032] Bugfixes [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/254395 (owner: 10Muehlenhoff) [14:10:28] (03CR) 10Filippo Giunchedi: [C: 04-1] "this also seems to move graphoid to its own module, but it isn't mentioned anywhere?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/254372 (owner: 10GWicke) [14:22:37] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [14:28:36] PROBLEM - Host mw1041 is DOWN: PING CRITICAL - Packet loss = 100% [14:32:17] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:35:13] !log swift eqiad-prod: set ms-be1019 / ms-be1020 / ms-be1021 weight 3000 [14:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:46] 6operations, 6Revscoring, 6Services, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1820537 (10Halfak) [14:43:47] 7Puppet, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 5Patch-For-Review, and 2 others: Puppetize npm/grunt manual setup - https://phabricator.wikimedia.org/T113903#1820545 (10hashar) [14:51:28] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, 10Traffic: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1820567 (10ori) a:5ori>3None [14:51:52] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, 10Traffic: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1820568 (10faidon) p:5High>3Unbreak! [14:55:22] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#1820570 (10Andrew) @Ejegg -- looks good. Next we need a post on this thread from your manager approving your access. [14:57:11] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1820578 (10Andrew) 5Open>3Resolved [15:03:37] PROBLEM - swift eqiad-prod object availability on graphite1001 is CRITICAL: CRITICAL: 38.78% of data under the critical threshold [90.0] [15:05:52] (03PS2) 10Dereckson: Improve throttle configuration file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253543 [15:05:54] (03PS1) 10Dereckson: Senate House Library throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254405 (https://phabricator.wikimedia.org/T118858) [15:14:24] Hi. There is an event tomorrow in London, they would need a throttle rule but didn't have the right IP previously. Could you merge the no-op 253543 and then deploy this new throttle rule at 254405? ^ [15:14:57] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps DWDM]BR [15:15:17] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 118, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps DWDM]BR [15:17:01] Dereckson: Can it wait for the next SWAT? [15:17:10] on monday? [15:17:16] Is there not one today? [15:17:20] Bleugh [15:17:22] it's friday! :) [15:17:28] gotta deploy on a friday [15:17:38] I'll get it when phpstorm let's me have access to my pc again :P [15:17:59] Thanks. [15:19:28] (03CR) 10Reedy: [C: 032] Improve throttle configuration file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253543 (owner: 10Dereckson) [15:19:48] (03Merged) 10jenkins-bot: Improve throttle configuration file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253543 (owner: 10Dereckson) [15:19:54] (03CR) 10Reedy: [C: 032] Senate House Library throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254405 (https://phabricator.wikimedia.org/T118858) (owner: 10Dereckson) [15:20:18] (03Merged) 10jenkins-bot: Senate House Library throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254405 (https://phabricator.wikimedia.org/T118858) (owner: 10Dereckson) [15:26:43] !log reedy@tin Synchronized wmf-config/throttle.php: Update throttle for event tomorrow (duration: 00m 28s) [15:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:57] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [15:28:17] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [15:28:18] does anyone know if https://www.mediawiki.org/wiki/MediaWiki_1.27/Roadmap is accurate (no train for the next 2 weeks?) [15:28:32] * aude understands none for next week because of holidays but why not the week after? [15:30:55] :/ [15:30:59] (Why: First of Dec Fundraising) [15:31:03] says deployment page [15:32:11] <_joe_> !log powercycling mw1041 [15:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:57] RECOVERY - Host mw1041 is UP: PING OK - Packet loss = 0%, RTA = 2.13 ms [15:35:40] (03PS1) 10Chad: Remove myself from udp2log-users, don't need [puppet] - 10https://gerrit.wikimedia.org/r/254408 [15:37:06] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1820618 (10Andrew) analytics-privatedata-users includes shell on stat1002, so that one group should do it. [15:38:05] (03PS1) 10Andrew Bogott: Add jgirault and jdrewniak to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/254409 [15:38:13] (03CR) 10Chad: "Oh yeah, there was some data on oxygen I wanted awhile back...for search? Don't need it anymore..." [puppet] - 10https://gerrit.wikimedia.org/r/254408 (owner: 10Chad) [15:38:42] (03CR) 10GWicke: RESTBase: Update to new specs & enable summary end point (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/254372 (owner: 10GWicke) [15:39:59] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1820633 (10Andrew) p:5Triage>3Normal [15:41:34] !log running sync-common on mw1041 [15:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:40] <_joe_> Reedy: don't bother, it's broken [15:44:47] (03PS3) 10GWicke: RESTBase: Update to new specs & enable summary end point [puppet] - 10https://gerrit.wikimedia.org/r/254372 [15:45:05] (03CR) 10GWicke: "@fgiunchedi: Updated the commit message to mention the graphoid tweak." [puppet] - 10https://gerrit.wikimedia.org/r/254372 (owner: 10GWicke) [15:45:12] (03PS1) 10Filippo Giunchedi: swift: add swift replication support via swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/254411 [15:45:14] (03PS1) 10Filippo Giunchedi: swift: add role::swift::swiftrepl to ms-fe1001 [puppet] - 10https://gerrit.wikimedia.org/r/254412 [15:45:40] _joe_, reedy there is anything specific to h/w with mw1041. The racadm log states System Software event...could be processor but really ambiguous [15:46:22] it's way out of warranty...may want to consider decommissioning if the error is persistent [15:46:34] (03CR) 10jenkins-bot: [V: 04-1] swift: add swift replication support via swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/254411 (owner: 10Filippo Giunchedi) [15:46:38] (03CR) 10jenkins-bot: [V: 04-1] swift: add role::swift::swiftrepl to ms-fe1001 [puppet] - 10https://gerrit.wikimedia.org/r/254412 (owner: 10Filippo Giunchedi) [15:46:40] <_joe_> cmjohnson1: I see errors in the mcelog [15:47:04] <_joe_> I still didn't dig in further, but yeah, +1 to decom [15:48:44] (03CR) 10Filippo Giunchedi: RESTBase: Update to new specs & enable summary end point (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/254372 (owner: 10GWicke) [15:49:22] (03PS1) 10Muehlenhoff: Add additional YubiKey-backed key for myself [puppet] - 10https://gerrit.wikimedia.org/r/254413 [15:51:18] 6operations, 10ops-eqiad: mw1041 has hardware issues - https://phabricator.wikimedia.org/T119199#1820635 (10Joe) 3NEW [15:51:52] (03CR) 10GWicke: "@Filippo, one new keyspace per group is going to be created, for a total of about 12 cfs." [puppet] - 10https://gerrit.wikimedia.org/r/254372 (owner: 10GWicke) [15:52:17] _joe_: isn't https://phabricator.wikimedia.org/T119199 a duplicate of https://phabricator.wikimedia.org/T118469? [15:52:38] (03PS2) 10Filippo Giunchedi: swift: add swift replication support via swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/254411 [15:52:40] (03PS2) 10Filippo Giunchedi: swift: add role::swift::swiftrepl to ms-fe1001 [puppet] - 10https://gerrit.wikimedia.org/r/254412 [15:58:17] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, 10Traffic: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1820653 (10Bawolff) the vhtcpd stats look weird to me ( http://ganglia.wikimedia.org/latest/stacked.php?m=vhtcpd_inpkts_sane&c=... [15:58:24] 6operations, 10ops-eqiad: mw1041 has hardware issues - https://phabricator.wikimedia.org/T119199#1820654 (10Joe) [16:12:32] ori: what we did in beta cluster was a hack sadly. We just had the users that needed to have consistent uids added to ldap. This works *most* of the time but occasionally ldap burps during a puppet run and then we get a local uid that shadows the ldap uid and things break until somebody removes the bad /etc/passwd entry. [16:12:56] I think that we may have fixed it somewhat when we did the apache->www-data changes on the MW servers [16:13:13] The l10nupdate uid isn't on https://wikitech.wikimedia.org/wiki/UID however [16:13:29] so it kind of makes sense it is breaking on the tin->mira sync [16:15:20] <_joe_> bd808: the issue there is that tin was created way before most of the uid fixes we did in the last few years [16:15:52] (03PS1) 10Giuseppe Lavagetto: mediawiki: decommission mw1041 [puppet] - 10https://gerrit.wikimedia.org/r/254417 (https://phabricator.wikimedia.org/T119199) [16:15:59] ah. so puppet would make the l10nupdate uid consistent among multiple hosts today? [16:16:28] <_joe_> bd808: I should check of course how l10update gets created [16:16:42] * bd808 is looking for it now [16:16:55] <_joe_> uhm chris just disappeared [16:17:02] <_joe_> !log depooled mw1041 [16:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:50] _joe_: confirmed, ::mediawiki::users pins l10nupdate's uid and gid to 10002 so mira is "right" and tin is "wrong" [16:18:38] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:44] <_joe_> bd808: I was about to tell you :) [16:20:21] <_joe_> bd808: so we should maybe fix tin, or not [16:20:26] mwdeploy still seems to have a non-specific uid+gid which may cause us issues in the tin<->mira sync at some point [16:20:27] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 8.346 second response time [16:21:11] _joe_: well we need to fix it on tin before the reimage so that we can get mira in the proper state before trying to cut over to using it as master [16:21:18] <_joe_> yup [16:21:40] <_joe_> bd808: we have bigger issues anyway, like trebuchet to move away first [16:21:41] as long as they differ each sync-* on tin will mess up the uids on mira again [16:22:47] trebuchet, ugh. The git remotes will need to be rewritten everywhere I bet [16:23:58] <_joe_> bd808: I know, next week I should work on that too [16:26:16] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:17] _joe_: when you rewrite them, will you use a service name rather than an host name to make the next master change easier? It seems like we don't use CNAMEs here much for that sort of thing [16:27:10] <_joe_> bd808: actually we do, on some newer things [16:27:14] <_joe_> and of course, yes :) [16:27:21] sigh. would be nice to have a solution that didn't depend on uuids matching, since that gets messed up in beta all the time, too, but also, it would be nice if we didn't have to run things as root to make that happen. [16:27:49] thcipriani: the beta problem is uids that are only "fixed" in ldap and not in puppet [16:27:50] (which is to say, I don't have an offhand solution) [16:28:31] and the reason for that is hysterical raisins (ie prod has mixed uids for the same users) [16:29:00] *nod* [16:33:38] it seems like trebuchet clobbers all kinds of permissions (because it runs as root) and scap has all sorts of permission problems (because it runs as all kinds of users). Also, puppet is just bad at doing permissions things retroactively after a server has been built. To paraphrase Jay-Z I got 99 problems and some portion of them are permissions errors :\ [16:33:56] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.763 second response time [16:34:06] 6operations: mwdeploy does not have the same user ID on all Apaches - https://phabricator.wikimedia.org/T79786#1820768 (10bd808) >>! In T79786#1471163, @fgiunchedi wrote: > is still an issue? `mwdeploy` user has different UIDs across the cluster now, but IMO that shouldn't matter and we should make sure to do ev... [16:34:52] 6operations, 7Swift: add ms-be1019 / 1020 / 1021 to swift - https://phabricator.wikimedia.org/T118183#1820772 (10fgiunchedi) ms-be1019 / 1020 / 1021 have been set to weight 3000, once that is fully rebalanced the following needs to happen: * move 1003 / 1004 from zone 2 to zone 1 * move 1012 from zone 2 to zo... [16:35:48] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1820777 (10JanZerebecki) A compromised key affects us equivalently as if we were not using encryption nor authentication. We do need to authenticate servers by cert. We may not rely on I... [16:36:19] thcipriani: *nod* I think scap does the more correct things personally but the historic use of system assigned uids/gids for what are functionally system accounts has caused problems. Some of those problems were created by switching from deb packages to puppet control for some user accounts. [16:37:19] uid/gid management across a large fleet is not a new problem for the world of syadmins. [16:37:55] the only place I didn't have problems like this was the network I ran in the olden days that used NIS+ for everything [16:38:05] but then we had NIS+ problems instead [16:39:14] (03PS2) 10Zfilipin: RuboCop: fixed Style/StringLiterals offense [puppet] - 10https://gerrit.wikimedia.org/r/253351 (https://phabricator.wikimedia.org/T112651) [16:40:58] (03PS1) 10Reedy: Add throttle config for Viquimarató nit digital/2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254425 (https://phabricator.wikimedia.org/T119205) [16:41:12] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/253351 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [16:41:34] (03CR) 10Zfilipin: "Patch set 2 fixes conflict in utils/hiera_lookup." [puppet] - 10https://gerrit.wikimedia.org/r/253351 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [16:42:02] (03CR) 10Reedy: [C: 032] Add throttle config for Viquimarató nit digital/2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254425 (https://phabricator.wikimedia.org/T119205) (owner: 10Reedy) [16:42:26] (03Merged) 10jenkins-bot: Add throttle config for Viquimarató nit digital/2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254425 (https://phabricator.wikimedia.org/T119205) (owner: 10Reedy) [16:43:27] (03Abandoned) 10Zfilipin: RuboCop: fixed Style/StringLiterals offense [puppet] - 10https://gerrit.wikimedia.org/r/253351 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [16:44:43] !log reedy@tin Synchronized wmf-config/throttle.php: Add throttle config for Viquimarató nit digital/2015 (duration: 00m 28s) [16:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:54] 6operations, 10Deployment-Systems: uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1820818 (10bd808) [[https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/mediawiki/manifests/users.pp;48c6a219b2b5048662cef1f8638bbe1e232c751f$41|Puppet says]] that l10nupdate... [16:46:08] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1820836 (10bd808) [16:46:29] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1819881 (10bd808) [16:46:54] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: decommission mw1041 [puppet] - 10https://gerrit.wikimedia.org/r/254417 (https://phabricator.wikimedia.org/T119199) (owner: 10Giuseppe Lavagetto) [16:51:11] <_joe_> !log decommissioning mw1041: removed facts and certs, removed from conftool and pybal [16:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:59] 6operations: mwdeploy does not have the same user ID on all Apaches - https://phabricator.wikimedia.org/T79786#1820846 (10fgiunchedi) IIRC rsync as root will try to DTRT and map name/uid on the destination side with what it found on source side (unless `--numeric-ids` is used). Have you seen differently when syn... [16:56:36] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1819140 (10Nuria) Adding @ottomata as we were talking about related ip chnages recently. I am in favour of removing the field entirely even with the awesome way we have to rotat... [16:57:42] (03PS1) 10Rush: WIP: Further hiera-ize role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/254426 [16:58:46] (03CR) 10jenkins-bot: [V: 04-1] WIP: Further hiera-ize role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/254426 (owner: 10Rush) [17:00:22] 6operations: mwdeploy does not have the same user ID on all Apaches - https://phabricator.wikimedia.org/T79786#1820878 (10bd808) >>! In T79786#1820846, @fgiunchedi wrote: > IIRC rsync as root will try to DTRT and map name/uid on the destination side with what it found on source side (unless `--numeric-ids` is us... [17:03:05] (03PS2) 10Rush: WIP: Further hiera-ize role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/254426 [17:06:46] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1820898 (10EBernhardson) I was mistaken for the eventlogging access, that happens on stat1003 (and is where t... [17:10:34] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, 10Traffic: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1820900 (10Bawolff) Given the timing of this all, I wonder if it has something to do with moving upload htcp to its own ip ( 29... [17:11:02] (03PS1) 10Zfilipin: RuboCop: regenerated TODO file [puppet] - 10https://gerrit.wikimedia.org/r/254427 [17:11:04] (03PS1) 10Zfilipin: RuboCop: fixed Style/DeprecatedHashMethods offense [puppet] - 10https://gerrit.wikimedia.org/r/254428 (https://phabricator.wikimedia.org/T112651) [17:22:44] Coren, chasemp, I moved our call 30 minutes earlier so as not to conflict with Faidon’s talk. That has us meeting in 10… that ok? [17:23:08] Yeah, that wfm. Ima go grab a quick bite to eat though. [17:23:10] All I have to say is, going to try one more round of hiera-ization hopefully you can be around to babysit w/ me [17:23:30] I'm cool w/ meeting but not sure what there is to talk about otherwise [17:23:56] (03PS3) 10Giuseppe Lavagetto: base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) [17:23:57] (03PS2) 10Giuseppe Lavagetto: etcd: switch to using the system-wide puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/243663 (https://phabricator.wikimedia.org/T114638) [17:23:59] (03PS2) 10Giuseppe Lavagetto: k8s: switch to using systems' CA [puppet] - 10https://gerrit.wikimedia.org/r/243662 (https://phabricator.wikimedia.org/T114638) [17:24:01] (03PS2) 10Giuseppe Lavagetto: toolschecker: switch to using base::puppet::ca [puppet] - 10https://gerrit.wikimedia.org/r/243666 (https://phabricator.wikimedia.org/T114638) [17:24:03] (03PS2) 10Giuseppe Lavagetto: conftool: switch to using base::puppet::ca [puppet] - 10https://gerrit.wikimedia.org/r/243664 (https://phabricator.wikimedia.org/T114638) [17:24:05] (03PS2) 10Giuseppe Lavagetto: eventlogging: switch to using base::puppet::ca [puppet] - 10https://gerrit.wikimedia.org/r/243665 (https://phabricator.wikimedia.org/T114638) [17:24:51] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: puppet fail [17:25:04] (03PS3) 10Rush: WIP: Further hiera-ize role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/254426 [17:30:21] (03Abandoned) 10Giuseppe Lavagetto: Add class base::puppet::ca [puppet] - 10https://gerrit.wikimedia.org/r/243661 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [17:32:22] (03PS1) 10RobH: updating star.planet.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/254431 [17:36:07] PROBLEM - mediawiki-installation DSH group on mw1041 is CRITICAL: Host mw1041 is not in mediawiki-installation dsh group [17:40:38] _joe_: ^^ did you remove it? [17:40:47] /fully [17:40:48] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/254428 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [17:40:51] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/254427 (owner: 10Zfilipin) [17:41:10] <_joe_> Reedy: I did, probably still didn't get distributed everywhere [17:46:14] (03CR) 10Andrew Bogott: "If the compiled is happy then I'm happy." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/254426 (owner: 10Rush) [17:51:47] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:53:28] (03PS4) 10Rush: WIP: Further hiera-ize role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/254426 [17:53:33] (03CR) 10Dzahn: [C: 032] "Moritz sent me a GPG signed statement that this key belongs to him and it had a good signature :)" [puppet] - 10https://gerrit.wikimedia.org/r/254413 (owner: 10Muehlenhoff) [17:54:51] (03PS2) 10Dzahn: Add additional YubiKey-backed key for myself [puppet] - 10https://gerrit.wikimedia.org/r/254413 (owner: 10Muehlenhoff) [17:59:29] (03PS5) 10Rush: WIP: Further hiera-ize role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/254426 [18:14:50] (03PS4) 10coren: toollabs: make sure /tmp and swap are large for all exec hosts [puppet] - 10https://gerrit.wikimedia.org/r/252506 (https://phabricator.wikimedia.org/T118419) (owner: 10Merlijn van Deen) [18:16:36] (03CR) 10coren: [C: 032] "Merging; puppet is disabled on all the instances this should affect." [puppet] - 10https://gerrit.wikimedia.org/r/252506 (https://phabricator.wikimedia.org/T118419) (owner: 10Merlijn van Deen) [18:34:28] (03PS1) 10Luke081515: Enable rollbacker and patroller group at maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254443 (https://phabricator.wikimedia.org/T118934) [18:35:26] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:35:46] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:36:00] (03PS2) 10Luke081515: Enable rollbacker and patroller group at maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254443 (https://phabricator.wikimedia.org/T118934) [18:37:37] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:39:07] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [18:42:44] (03PS3) 10Luke081515: Add new group "curator" to enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252012 (https://phabricator.wikimedia.org/T113109) [18:44:33] Dereckson: Problem with unmerged changes in gerrit 252012 should be ok now ;) [18:46:38] (03PS1) 10coren: Tools: can't actually use ../init.pp in a subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/254447 [18:46:47] andrewbogott: Quick review of ^^ for a simple fix? [18:48:16] (03CR) 10Andrew Bogott: [C: 031] Tools: can't actually use ../init.pp in a subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/254447 (owner: 10coren) [18:48:41] (03CR) 10coren: [C: 032] "*grumble grumble* puppet *grumble*" [puppet] - 10https://gerrit.wikimedia.org/r/254447 (owner: 10coren) [18:52:19] Hi Luke081515. [18:52:28] Hi :) [18:54:44] Luke081515: git review -d 252012 ; git rebase master -> Yes, it's mergeable and good. [18:55:13] !log resuming interrupted arbcomlist invocation on terbium [18:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:38] (03CR) 10Dereckson: [C: 031] Add new group "curator" to enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252012 (https://phabricator.wikimedia.org/T113109) (owner: 10Luke081515) [18:56:02] Dereckson: Thanks :) [18:59:47] 6operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750#1821283 (10JanZerebecki) It seems like https://shop.nitrokey.com/shop is a viable fully free hardware+software alternative to Yubikey NEO with a similar price tag. (Their businesses address is 5 underground stat... [19:05:29] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1821300 (10leila) @Nuria, which field are you referring to? clientIP in EL tables? If so, let's chat about it before removing that field since part of the research we are doing r... [19:07:34] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1821330 (10Nuria) @leila: Understood but see comments about this being broken on several tables since 20150616. [19:08:51] 6operations: Update node_js to latest 0.10.x release - https://phabricator.wikimedia.org/T119218#1821338 (10hashar) [19:09:20] 6operations, 6Services: Update node_js to latest 0.10.x release - https://phabricator.wikimedia.org/T119218#1821188 (10hashar) + #Services since they have a bunch of nodejs daemon. [19:16:08] 7Blocked-on-Operations, 6operations, 10RESTBase, 6Services: Switch RESTBase to use Node.js 4 - https://phabricator.wikimedia.org/T107762#1821382 (10GWicke) p:5Normal>3High [19:17:33] 6operations, 6Services: Update node_js to latest 0.10.x release - https://phabricator.wikimedia.org/T119218#1821395 (10GWicke) For RB, we are looking into the current LTS 4.2 instead: T107762 I just bumped the priority on that one, and hope that we can start the gradual migration soon. It will likely be a Jes... [19:27:25] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 383 bytes in 0.047 second response time [19:27:32] 6operations, 6Services: Update node_js to latest 0.10.x release - https://phabricator.wikimedia.org/T119218#1821415 (10cscott) There have been a bunch of security patches for the node 0.10.x series, for example 0.10.37 was a security release: http://dailyjs.com/2015/03/18/1399-node-roundup/ For that reason I'... [19:27:41] what's going on with tools? [19:28:20] nothing in sal to denote it should be down [19:29:25] 7Blocked-on-Operations, 6operations, 10RESTBase, 6Services: Switch RESTBase to use Node.js 4 - https://phabricator.wikimedia.org/T107762#1821437 (10cscott) [19:29:49] coren is investigating the tools alert [19:31:43] 7Blocked-on-Operations, 6operations, 10RESTBase, 6Services: Switch RESTBase to use Node.js 4 - https://phabricator.wikimedia.org/T107762#1821444 (10cscott) Added a similar task for Parsoid (T119228), blocked by this one. Once RESTBase switches we'll want to switch shortly afterwards, but we don't want to... [19:35:23] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 930109 bytes in 3.630 second response time [19:36:14] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1821451 (10ori) >>! In T119144#1821300, @leila wrote: > @Nuria, which field are you referring to? clientIP in EL tables? If so, let's chat about it before removing that field sin... [19:36:35] (03CR) 10Krinkle: graphite: Clarify description of graphite_threshold for reqstats.5xx (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252584 (owner: 10Krinkle) [19:36:39] (03PS3) 10Krinkle: graphite: Clarify description of graphite_threshold for reqstats.5xx [puppet] - 10https://gerrit.wikimedia.org/r/252584 [19:45:24] (03PS1) 10coren: Labs: fix ordering of creation for lvm_volumes [puppet] - 10https://gerrit.wikimedia.org/r/254455 [19:45:31] YuviPanda: can you advise re: paramiko [19:45:36] YuviPanda: ^^ the issue was more subtle than first appears. [19:46:34] (We didn't forget to set sticky, the code that does it was subtly broken) [19:47:33] looking [19:48:23] The old manifest would work - but only on the second puppet run, leaving at least 30 minutes of brokenness. [19:48:33] (or 20) [19:52:33] 6operations, 10Wikimedia-SVG-rendering, 7Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#1821501 (10Dvorapa) Another impressive example: - https://cs.wikipedia.org/wiki/Wikipedista:Dvorapa/P%C3%ADskovi%C5%A1... [19:53:03] Coren: can you put in a comment block explaining this *in* the code itself? [19:53:19] andrewbogott: so... paramiko is still the ssh implementation in python and unfortunately I don't think it's been updated to support new stuff yet [19:53:42] YuviPanda: I just now updated it to the latest, but the latest still doesn’t do what I need. [19:53:48] YuviPanda: any idea what to use instead? [19:53:52] just shelling out to ssh? [19:54:08] andrewbogott: yeah that's what scap3 does [19:54:11] It used to work for cert-cleaning but something must’ve changed with the ssh version [19:54:16] ok, damn [19:54:17] thanks [19:54:34] 6operations, 6Labs, 10Labs-Infrastructure: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#1821505 (10Andrew) a:3Andrew [19:54:58] 6operations: Weird message from the Facebook team to list admins - https://phabricator.wikimedia.org/T119232#1821507 (10Selsharbaty-WMF) 3NEW [19:55:19] (03PS2) 10coren: Labs: fix ordering of creation for lvm_volumes [puppet] - 10https://gerrit.wikimedia.org/r/254455 [19:55:20] YuviPanda: {{done}} ^^ [19:56:08] (03CR) 10Yuvipanda: [C: 031] Labs: fix ordering of creation for lvm_volumes [puppet] - 10https://gerrit.wikimedia.org/r/254455 (owner: 10coren) [19:57:08] (03CR) 10coren: [C: 032] Labs: fix ordering of creation for lvm_volumes [puppet] - 10https://gerrit.wikimedia.org/r/254455 (owner: 10coren) [20:06:11] (03PS1) 10Andrew Bogott: Rename holmium to labservices1002. [dns] - 10https://gerrit.wikimedia.org/r/254463 (https://phabricator.wikimedia.org/T106303) [20:06:51] (03PS1) 10Andrew Bogott: Switch primary designate host to labservices1001. [puppet] - 10https://gerrit.wikimedia.org/r/254464 (https://phabricator.wikimedia.org/T106303) [20:06:53] (03PS1) 10Andrew Bogott: Rename holmium to labservices1002. [puppet] - 10https://gerrit.wikimedia.org/r/254465 [20:07:08] (03PS2) 10Andrew Bogott: Rename holmium to labservices1002. [dns] - 10https://gerrit.wikimedia.org/r/254463 (https://phabricator.wikimedia.org/T106303) [20:08:52] (03PS3) 10Andrew Bogott: Rename holmium to labservices1002. [dns] - 10https://gerrit.wikimedia.org/r/254463 (https://phabricator.wikimedia.org/T106303) [20:09:01] (03PS2) 10Andrew Bogott: Rename holmium to labservices1002. [puppet] - 10https://gerrit.wikimedia.org/r/254465 [20:09:54] ... [20:10:03] /home/dereckson/dev/mediawiki/operations/mediawiki-config/w/static/images/project-logos ] rm ladwiki.png [20:10:06] /home/dereckson/dev/mediawiki/operations/mediawiki-config/w/static/images/project-logos ] arc download F2972988 [20:10:09] We don't have an .arcconfig in this repo :( [20:14:13] (03PS2) 10Rush: Switch primary designate host to labservices1001. [puppet] - 10https://gerrit.wikimedia.org/r/254464 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [20:14:21] (03CR) 10Rush: [C: 031] "makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/254464 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [20:14:45] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/1336/" [puppet] - 10https://gerrit.wikimedia.org/r/254426 (owner: 10Rush) [20:15:15] 6operations, 7Icinga: icinga-wm not outputing messages for alerts that also paged and are dba-related - https://phabricator.wikimedia.org/T118072#1821625 (10Dzahn) [20:16:52] (03PS1) 10Dereckson: Arcanist configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254468 [20:17:03] (03CR) 10Andrew Bogott: [C: 032] Switch primary designate host to labservices1001. [puppet] - 10https://gerrit.wikimedia.org/r/254464 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [20:20:33] (03PS8) 10Yuvipanda: quarry: Move to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253531 [20:21:29] (03CR) 10Dereckson: "This allows to tell Arcanist, the phabricator CLI client to interact with our instance." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254468 (owner: 10Dereckson) [20:21:31] (03PS6) 10Rush: Further hiera-ize role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/254426 [20:22:18] (03PS7) 10Rush: Further hiera-ize role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/254426 [20:22:21] (03CR) 10Yuvipanda: [C: 032] quarry: Move to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253531 (owner: 10Yuvipanda) [20:22:47] PROBLEM - designate-pool-manager process on holmium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-pool-manager [20:22:54] (03PS1) 10Dereckson: Logo update for lad.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254471 (https://phabricator.wikimedia.org/T118491) [20:22:56] PROBLEM - designate-central process on holmium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-central [20:22:57] andrewbogott: ^ [20:23:06] PROBLEM - designate-api process on holmium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-api [20:23:18] PROBLEM - designate-sink process on holmium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-sink [20:23:24] ok, fixing... [20:23:37] PROBLEM - designate-mdns process on holmium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-mdns [20:24:30] I should’ve seen that one coming [20:29:23] (03PS6) 10Yuvipanda: ores: Move to using redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/254119 [20:38:09] (03CR) 10Yuvipanda: [C: 032] ores: Move to using redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/254119 (owner: 10Yuvipanda) [20:41:15] (03PS1) 10Dereckson: Enable subpages on custom aliases from 112 to 119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254475 [20:41:15] (03PS1) 10Dereckson: Namespace configuration for en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254476 [20:43:16] (03PS2) 10Dereckson: Namespace configuration for en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254476 (https://phabricator.wikimedia.org/T119207) [20:51:34] (03PS3) 10Yuvipanda: labs: Cleanup role classes (part #1) [puppet] - 10https://gerrit.wikimedia.org/r/254124 [20:51:44] (03PS1) 10Andrew Bogott: Revert "Switch primary designate host to labservices1001." [puppet] - 10https://gerrit.wikimedia.org/r/254481 [20:51:56] (03PS2) 10Andrew Bogott: Revert "Switch primary designate host to labservices1001." [puppet] - 10https://gerrit.wikimedia.org/r/254481 [20:52:04] (03PS2) 10Ori.livneh: redis::instance: support hash configuration values [puppet] - 10https://gerrit.wikimedia.org/r/254327 [20:52:12] (03CR) 10Ori.livneh: [C: 032 V: 032] redis::instance: support hash configuration values [puppet] - 10https://gerrit.wikimedia.org/r/254327 (owner: 10Ori.livneh) [20:53:25] ori: I only have one thing left to move (tools-redis) and I might take this opportunity to move it to jessie [20:54:30] (03PS3) 10Andrew Bogott: Revert "Switch primary designate host to labservices1001." [puppet] - 10https://gerrit.wikimedia.org/r/254481 [20:55:40] (03CR) 10Andrew Bogott: [C: 032] Revert "Switch primary designate host to labservices1001." [puppet] - 10https://gerrit.wikimedia.org/r/254481 (owner: 10Andrew Bogott) [20:56:32] (03PS1) 10Dereckson: Rights configuration on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254486 (https://phabricator.wikimedia.org/T118847) [20:57:07] (03CR) 10jenkins-bot: [V: 04-1] Rights configuration on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254486 (https://phabricator.wikimedia.org/T118847) (owner: 10Dereckson) [20:58:57] (03PS2) 10Dereckson: Rights configuration on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254486 (https://phabricator.wikimedia.org/T118847) [21:00:16] (03PS4) 10Yuvipanda: labs: Cleanup role classes (part #1) [puppet] - 10https://gerrit.wikimedia.org/r/254124 [21:03:06] (03PS5) 10Yuvipanda: labs: Cleanup and move role classes (part #1) [puppet] - 10https://gerrit.wikimedia.org/r/254124 [21:04:17] (03PS3) 10Andrew Bogott: Rename holmium to labservices1002. [puppet] - 10https://gerrit.wikimedia.org/r/254465 [21:04:19] (03PS1) 10Andrew Bogott: Switch primary designate host to labservices1001. [puppet] - 10https://gerrit.wikimedia.org/r/254489 (https://phabricator.wikimedia.org/T106303) [21:05:14] (03PS4) 10Andrew Bogott: Rename holmium to labservices1002. [dns] - 10https://gerrit.wikimedia.org/r/254463 (https://phabricator.wikimedia.org/T106303) [21:05:38] (03PS4) 10Andrew Bogott: Rename holmium to labservices1002. [puppet] - 10https://gerrit.wikimedia.org/r/254465 [21:06:09] (03CR) 10Andrew Bogott: [C: 04-1] "this broke something, and I don't yet know what" [puppet] - 10https://gerrit.wikimedia.org/r/254489 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [21:06:45] (03CR) 10Yuvipanda: [C: 032 V: 032] "QUICK QUICK QUICK QUICK" [puppet] - 10https://gerrit.wikimedia.org/r/254124 (owner: 10Yuvipanda) [21:17:02] (03PS1) 10MaxSem: WIP: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [21:18:41] (03CR) 10jenkins-bot: [V: 04-1] WIP: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [21:23:15] (03PS8) 10Rush: Further hiera-ize role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/254426 [21:24:13] (03CR) 10jenkins-bot: [V: 04-1] Further hiera-ize role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/254426 (owner: 10Rush) [21:25:44] (03PS1) 10Yuvipanda: Stop importing manifests/role/labs/* [puppet] - 10https://gerrit.wikimedia.org/r/254493 [21:33:36] (03PS9) 10Rush: Further hiera-ize role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/254426 [21:34:29] (03PS1) 10Yuvipanda: labs: Rename and move DNS roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/254495 [21:35:10] chasemp: heh, ^ conflicts too [21:35:18] I'll back off and let you finish (whenever) [21:35:19] yeah :) [21:35:29] I can do my moving around later [21:35:43] chasemp: heh, just saw this on the other channel :) [21:36:04] (03PS2) 10MaxSem: WIP: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [21:37:29] (03CR) 10jenkins-bot: [V: 04-1] WIP: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [21:37:39] <3 jerkins [21:37:41] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#1821857 (10Ejegg) a:5Ejegg>3Wwes Hi Wes! @K4-713 is out of town for a while, and the Fundraising team is eager to have an in-house way of tracking email clicks (see T114010). I need man... [21:37:47] (03PS1) 10Smalyshev: Support /sparql as an endpoint [puppet] - 10https://gerrit.wikimedia.org/r/254497 (https://phabricator.wikimedia.org/T119081) [21:37:49] (03CR) 10Rush: [C: 032] Further hiera-ize role/labs/openstack/ [puppet] - 10https://gerrit.wikimedia.org/r/254426 (owner: 10Rush) [21:40:29] (03PS3) 10MaxSem: WIP: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [21:45:15] !log csteipp@tin Synchronized php-1.27.0-wmf.7/includes/User.php: Deploy fix for T119021 (duration: 00m 28s) [21:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:49] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: puppet fail [22:04:06] * andrewbogott looks at ^ [22:04:15] yeah that's my stuff I think [22:04:24] second try worked fine [22:04:40] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [22:04:52] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#1821911 (10Wwes) Approved [22:05:27] 6operations: apt-get update partial failure lots of places - https://phabricator.wikimedia.org/T119242#1821912 (10Andrew) 3NEW [22:07:30] PROBLEM - Recursive DNS on 208.80.155.118 is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:08:12] 6operations: apt-get update partial failure lots of places - https://phabricator.wikimedia.org/T119242#1821921 (10Andrew) This seems to be a known issue with an apt race condition: https://askubuntu.com/questions/41605/trouble-downloading-packages-list-due-to-a-hash-sum-mismatch-error [22:08:54] labs-recursor1 [22:09:19] RECOVERY - Recursive DNS on 208.80.155.118 is OK: DNS OK: 0.186 seconds response time. www.wikipedia.org returns 208.80.154.224 [22:10:39] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1821924 (10bd808) These mismatches are only really a problem when root gets involved. It doesn't matter what the uid/gids are **until** you start using rsync/tar/whatever as root to... [22:25:27] (03PS1) 10Rush: labsdns::recursor use top level scope for ::network [puppet] - 10https://gerrit.wikimedia.org/r/254581 [22:25:44] (03PS2) 10Rush: labsdns::recursor use top level scope for ::network [puppet] - 10https://gerrit.wikimedia.org/r/254581 [22:25:50] (03CR) 10jenkins-bot: [V: 04-1] labsdns::recursor use top level scope for ::network [puppet] - 10https://gerrit.wikimedia.org/r/254581 (owner: 10Rush) [22:31:24] (03PS3) 10Rush: labsdns::recursor use top level scope for ::network [puppet] - 10https://gerrit.wikimedia.org/r/254581 [22:31:47] (03CR) 10jenkins-bot: [V: 04-1] labsdns::recursor use top level scope for ::network [puppet] - 10https://gerrit.wikimedia.org/r/254581 (owner: 10Rush) [22:32:02] (03PS4) 10Rush: labsdns::recursor use top level scope for ::network [puppet] - 10https://gerrit.wikimedia.org/r/254581 [22:33:51] (03CR) 10Rush: [C: 032] labsdns::recursor use top level scope for ::network [puppet] - 10https://gerrit.wikimedia.org/r/254581 (owner: 10Rush) [22:40:01] (03PS1) 10Andrew Bogott: Replace ::all_networks that was dropped in a previous patch [puppet] - 10https://gerrit.wikimedia.org/r/254586 [22:40:02] chasemp: is it just ^ ? [22:40:34] ah let's try it [22:40:49] (03CR) 10Rush: [C: 032] Replace ::all_networks that was dropped in a previous patch [puppet] - 10https://gerrit.wikimedia.org/r/254586 (owner: 10Andrew Bogott) [22:40:56] (03CR) 10Rush: [V: 032] Replace ::all_networks that was dropped in a previous patch [puppet] - 10https://gerrit.wikimedia.org/r/254586 (owner: 10Andrew Bogott) [22:41:20] (03CR) 10Andrew Bogott: Further hiera-ize role/labs/openstack/ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/254426 (owner: 10Rush) [22:42:24] andrewbogott: thanks it was just a stupid thing then :) [22:42:40] fixed? [22:43:30] yeah that did it [22:43:38] on to the next one [22:44:15] chasemp: am I clear to do my moves (hah!) or do you want me to hold off till monday? [22:44:17] * YuviPanda can do [22:44:35] YuviPanda: still rolling through a few to enable puppet and watching, assuming all is well I need maybe 10 minutes? [22:46:05] (03PS1) 10Dzahn: racktables: remove role from magnesium [puppet] - 10https://gerrit.wikimedia.org/r/254588 [22:47:32] chasemp: sure! np [22:47:35] I'm in no hurry [22:47:35] (03PS2) 10Dzahn: racktables: remove role from magnesium [puppet] - 10https://gerrit.wikimedia.org/r/254588 (https://phabricator.wikimedia.org/T105555) [22:48:31] (03PS3) 10Dzahn: racktables: remove role from magnesium [puppet] - 10https://gerrit.wikimedia.org/r/254588 (https://phabricator.wikimedia.org/T105555) [22:48:44] (03CR) 10Dzahn: [C: 032] racktables: remove role from magnesium [puppet] - 10https://gerrit.wikimedia.org/r/254588 (https://phabricator.wikimedia.org/T105555) (owner: 10Dzahn) [22:53:56] (03CR) 10Smalyshev: [C: 031] WDQS Also use queryStartCount counter [puppet] - 10https://gerrit.wikimedia.org/r/254378 (https://phabricator.wikimedia.org/T119178) (owner: 10Addshore) [22:59:56] (03CR) 10Luke081515: [C: 031] Rights configuration on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254486 (https://phabricator.wikimedia.org/T118847) (owner: 10Dereckson) [23:01:48] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1822041 (10DStrine) The security task is closed. Is this still stalled and #blocked-by-operations ? What would un-stall it? Should it be as... [23:09:55] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1822062 (10chasemp) We need to make a plan to get connectivity through to the end host for this. This will probably fall on operations yes b... [23:11:02] YuviPanda: all seems clear [23:12:44] chasemp: cool [23:15:11] YuviPanda: role/labs/openstack reduced from 859 to 480 lines [23:15:14] :) [23:15:22] \o/ [23:15:25] \o/\o/ [23:15:27] yay for cleanup [23:17:26] YuviPanda: i have another small one that'll make you twitch [23:17:44] `base::standard-packages` [23:17:47] Twitch Cleans up Puppet [23:17:48] should be _ [23:17:49] aawrgh [23:18:00] it should probably be rolled into something else [23:19:39] YuviPanda: we'll end up installing saltstack :P [23:20:03] (Re: Twitch Cleans up Puppet) [23:23:30] heh [23:44:59] (03PS4) 10Ori.livneh: graphite: Clarify description of graphite_threshold for reqstats.5xx [puppet] - 10https://gerrit.wikimedia.org/r/252584 (owner: 10Krinkle) [23:45:22] (03CR) 10Ori.livneh: [C: 032 V: 032] graphite: Clarify description of graphite_threshold for reqstats.5xx [puppet] - 10https://gerrit.wikimedia.org/r/252584 (owner: 10Krinkle) [23:55:01] 6operations, 6Reading-Admin: Improve UX Strategic Test - https://phabricator.wikimedia.org/T117826#1822249 (10dr0ptp4kt)