[00:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170615T0000). [00:02:22] !log Deploying phabricator update (tagged release/2017-06-14/1) details: https://phabricator.wikimedia.org/project/view/2831/ [00:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:58] (03PS1) 10Dzahn: add releases1001.eqiad.wmnet to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/359080 (https://phabricator.wikimedia.org/T164030) [00:10:55] 10Operations, 10DNS, 10Traffic: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3350353 (10Bawolff) Hmm. W.wiki seems to be similar in that we own the domain but the A record points to AWS. It has the additional interesting thing in that its included in the su... [00:12:01] RECOVERY - Check systemd state on dumpsdata1001 is OK: OK - running: The system is fully operational [00:15:27] !log dumpsdata1001 - was reported in icinga as CRIT systemdstate - reason was puppet service was failed with "Invalid value '"no"' for boolean parameter: daemonize" (it was ok on other hosts??). commented the option, stopped puppet, systemctl reset-failed - which made it recover (T165368) [00:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:36] T165368: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368 [00:17:32] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3350376 (10BBlack) >>! In T167920#3349772, @Haiku-narrative wrote: > we actually haven't been given the numbers yet, but I expect us to handle a few million requests over the course of a c... [00:18:11] 10Operations, 10DNS, 10Traffic: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3350377 (10Bawolff) Oh, i see, the A record points to a page which is hosted by the registrar for .wiki (who happens to use aws- i didnt originally think of navigating to 54.148.61... [00:26:57] (03PS1) 10Dzahn: base/puppet: use "false" instead of "no" with "daemonize" option [puppet] - 10https://gerrit.wikimedia.org/r/359084 [00:28:31] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 844 bytes in 0.077 second response time [00:37:44] (03CR) 10Dzahn: [C: 032] add releases1001.eqiad.wmnet to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/359080 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [00:39:09] (03CR) 10Dzahn: [C: 04-1] "let's put it on "releases1001" instead" [puppet] - 10https://gerrit.wikimedia.org/r/359029 (owner: 10Chad) [00:40:43] (03CR) 10Dzahn: [C: 04-1] "let's make a minimal role, that includes 2 profiles, one for mwreleases and (later) one for what is currently on bromine. then put jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/359029 (owner: 10Chad) [00:41:44] (03CR) 10jenkins-bot: Add atjwiki meta namespace talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359066 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [00:41:47] (03PS7) 10Dzahn: Ircecho: Fix bot not to use carriage returns [puppet] - 10https://gerrit.wikimedia.org/r/359025 (owner: 10Paladox) [00:42:40] (03CR) 10jenkins-bot: Remove duplicate config from CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358639 (owner: 10Reedy) [00:43:49] (03CR) 10jenkins-bot: Cleanup ORES config: Drop wgOresExtensionStatus (default) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354732 (owner: 10Jforrester) [00:45:55] (03CR) 10jenkins-bot: Update interwiki.php, at atkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359054 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [00:46:41] (03CR) 10jenkins-bot: Revert "Test elastic2020 does not fall out of cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359007 (https://phabricator.wikimedia.org/T149006) (owner: 10Gehel) [00:46:58] (03PS1) 10Dzahn: install_server: update MAC address of releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/359086 (https://phabricator.wikimedia.org/T164030) [00:47:29] (03CR) 10jenkins-bot: Enable $wgStructuredChangeEnableExperimentalViews in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359071 (https://phabricator.wikimedia.org/T164130) (owner: 10Catrope) [00:47:40] (03PS2) 10Dzahn: install_server: update MAC address of releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/359086 (https://phabricator.wikimedia.org/T164030) [00:48:03] (03CR) 10Dzahn: [C: 032] install_server: update MAC address of releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/359086 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [00:48:22] (03CR) 10jenkins-bot: Remove Linter from extension-list-labs, redundant to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359023 (owner: 10Chad) [00:48:58] (03CR) 10jenkins-bot: Remove LoginNotify from extension-list-labs, redundant to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359024 (owner: 10Chad) [00:49:45] (03CR) 10jenkins-bot: Promote CollaborationKit to the big leagues; deploy on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343697 (https://phabricator.wikimedia.org/T138326) (owner: 10Reedy) [00:50:15] (03CR) 10jenkins-bot: Add atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359035 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [00:51:13] (03CR) 10jenkins-bot: Cleanup ORES config: Alphasort wmgUseORES [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359075 (owner: 10Jforrester) [01:10:50] (03PS1) 10Dzahn: install_server: switch releases1001 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/359087 (https://phabricator.wikimedia.org/T164030) [01:11:33] (03PS2) 10Dzahn: install_server: switch releases1001 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/359087 (https://phabricator.wikimedia.org/T164030) [01:12:55] (03CR) 10Dzahn: [C: 032] install_server: switch releases1001 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/359087 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [01:16:33] (03CR) 10Dzahn: "where could we test it?" [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [01:17:56] !log releases1001 - reinstalling with stretch [01:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:51] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:32:03] ^ that was me, it's ok :) [01:32:51] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [01:37:37] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3350488 (10Dzahn) reinstalled as releases1001, with stretch. the "releasers-mediawiki" group has shell (again). w... [01:40:41] PROBLEM - Disk space on ms-be1010 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdk1 is not accessible: Input/output error [01:50:06] (03CR) 10Dzahn: "https://docs.puppet.com/puppet/latest/configuration.html#daemonize" [puppet] - 10https://gerrit.wikimedia.org/r/359084 (owner: 10Dzahn) [01:51:12] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[mountpoint-/srv/swift-storage/sdk1] [01:54:22] PROBLEM - Check systemd state on releases1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:01:42] RECOVERY - Disk space on ms-be1010 is OK: DISK OK [02:24:22] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [02:26:12] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.4) (duration: 09m 15s) [02:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:27] (03PS1) 10Dzahn: releases: add new role/profile, add backups, install jenkins [puppet] - 10https://gerrit.wikimedia.org/r/359089 (https://phabricator.wikimedia.org/T164030) [02:43:28] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.5) (duration: 07m 34s) [02:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:13] PROBLEM - Disk space on ms-be1011 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sda1 is not accessible: Input/output error [02:50:16] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jun 15 02:50:16 UTC 2017 (duration 6m 48s) [02:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:12] RECOVERY - Disk space on ms-be1011 is OK: DISK OK [03:02:02] PROBLEM - puppet last run on ms-be1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[mountpoint-/srv/swift-storage/sda1] [03:05:32] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[varnish] [03:08:59] (03PS2) 10Dzahn: base/puppet: use "false" instead of "no" with "daemonize" option [puppet] - 10https://gerrit.wikimedia.org/r/359084 (https://phabricator.wikimedia.org/T166371) [03:12:18] 10Operations, 10Operations-Software-Development, 10monitoring, 10Patch-For-Review: Monitoring: create an alert for daemonized puppet - https://phabricator.wikimedia.org/T166371#3350548 (10Dzahn) uhmm.. i think we have to use "false" instead of "no". Well, the confusion is that the command line parameters a... [03:13:19] (03CR) 10Dzahn: "please see https://gerrit.wikimedia.org/r/#/c/359084/" [puppet] - 10https://gerrit.wikimedia.org/r/359084 (https://phabricator.wikimedia.org/T166371) (owner: 10Dzahn) [03:14:01] (03CR) 10Dzahn: "wrong paste. i meant, please see https://phabricator.wikimedia.org/T166371#3350548" [puppet] - 10https://gerrit.wikimedia.org/r/359084 (https://phabricator.wikimedia.org/T166371) (owner: 10Dzahn) [03:27:04] (03CR) 10Dzahn: [C: 04-1] "suggesting to do it here instead https://gerrit.wikimedia.org/r/#/c/359089/" [puppet] - 10https://gerrit.wikimedia.org/r/359029 (owner: 10Chad) [03:29:22] RECOVERY - puppet last run on ms-be1011 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [03:29:36] (03CR) 10Dzahn: [C: 031] "i'm pretty sure we want it to be boolean but still would be nice if somebody can confirm. because i _did_ test "no" and that also seemed t" [puppet] - 10https://gerrit.wikimedia.org/r/359084 (https://phabricator.wikimedia.org/T166371) (owner: 10Dzahn) [03:31:38] (03CR) 10Dzahn: [C: 031] "looks good to me but was running out of time to actually babysit it" [puppet] - 10https://gerrit.wikimedia.org/r/359025 (owner: 10Paladox) [03:33:52] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [04:44:02] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 2 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3350562 (10Dispenser) The ongoing abuse has costs: in disk space 1 TB wasted so far, for administrators blocking dozens of accounts and d... [04:59:09] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358966 [05:20:21] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358966 (owner: 10Marostegui) [05:21:55] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358966 (owner: 10Marostegui) [05:22:04] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358966 (owner: 10Marostegui) [05:22:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1036 - T166205 (duration: 00m 44s) [05:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:02] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [05:25:16] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 2 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3350573 (10Marostegui) p:05Triage>03Normal [05:25:28] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 2 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3350574 (10zhuyifei1999) >>! In T167400#3350562, @Dispenser wrote: > Then (somehow) configure Varnish to understand WP0 IP ranges It [[h... [05:33:02] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3350577 (10Dzahn) https://meta.wikimedia.org/w/index.php?title=Requests_for_new_l... [05:41:07] !log Deploy alter table s4 - dbstore1001 - T166206 [05:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:17] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [05:45:31] (03PS1) 10Marostegui: db-eqiad.php: Add comment to db1018 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359092 (https://phabricator.wikimedia.org/T166205) [05:46:57] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3350586 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/358410/ -> 2017-06-12 20:3... [05:47:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add comment to db1018 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359092 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [05:48:34] (03Merged) 10jenkins-bot: db-eqiad.php: Add comment to db1018 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359092 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [05:48:43] (03CR) 10jenkins-bot: db-eqiad.php: Add comment to db1018 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359092 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [05:49:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add comments to db1018 current status - T166205 (duration: 00m 43s) [05:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:03] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [05:50:40] !log Deploy alter table s2 - db1018 - T166205 [05:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:12] PROBLEM - configured eth on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:02] RECOVERY - configured eth on ms-be1019 is OK: OK - interfaces up [06:08:06] !log Deploy alter table s2 - labsdb1003 - T166205 [06:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:16] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [06:24:52] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:25:50] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3350596 (10Marostegui) p:05Triage>03Normal It would be helpful also if you could provide the start/end hour of your tests so we can identify those in our graphs. [06:29:52] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 12 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:43:19] (03CR) 10Hashar: "Ah probably yes. I guess I got confused with your comment about firejail messing up parameters:" [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [06:51:13] (03CR) 10Hashar: [C: 031] "IIRC a User resource auto requires the Group resource. But it does not hurt to be explicit." [puppet] - 10https://gerrit.wikimedia.org/r/359012 (owner: 10Chad) [06:55:01] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 2 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3331800 (10Bawolff) > Wikipedia Zero traffic is tied to IP addresses, not users. So it definitely could be performant. Have MediaWiki set... [07:02:44] (03CR) 10ArielGlenn: [C: 032] script to generate pagesperchunkhistory config setting for a given wiki [dumps] - 10https://gerrit.wikimedia.org/r/355075 (owner: 10ArielGlenn) [07:04:28] (03CR) 10ArielGlenn: [C: 032] make dumps using extension scripts work without MWScript stuff [dumps] - 10https://gerrit.wikimedia.org/r/355076 (owner: 10ArielGlenn) [07:04:32] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:04:33] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:04:33] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:04:33] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:04:33] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:05:02] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:05:02] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:05:02] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:05:32] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received [07:05:32] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [07:05:32] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received [07:06:32] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:08:02] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:08:32] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:10:32] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [07:11:32] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [07:11:32] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [07:12:32] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [07:12:32] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [07:13:32] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [07:14:32] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [07:14:42] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:15:42] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:16:02] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [07:16:42] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:16:42] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:17:02] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [07:17:42] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received [07:18:42] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:18:42] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:19:04] (03CR) 10Hashar: "That is promising first pass. The daemon should be running in foreground mode (-d) and we should let systemd handle the PID tracking for " (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [07:22:12] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [07:22:32] (03CR) 10Muehlenhoff: Zuul: Add systemd script for zuul (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [07:23:32] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [07:23:32] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [07:23:32] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [07:23:32] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [07:23:33] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [07:23:33] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [07:23:33] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [07:24:13] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:25:12] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [07:25:12] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:26:42] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received [07:26:42] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:28:12] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:28:32] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [07:29:42] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:29:42] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:30:42] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:30:42] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:30:42] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:30:56] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3350632 (10Amire80) Namespace translations are needed. You can see the list of na... [07:31:12] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [07:31:12] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [07:31:42] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received [07:32:32] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [07:32:32] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [07:32:32] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [07:32:33] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [07:32:42] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [07:32:42] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [07:32:42] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [07:33:42] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [07:34:12] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:34:12] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:35:42] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received [07:35:42] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received [07:36:22] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:36:43] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:36:44] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:36:44] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:36:44] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:37:42] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:37:42] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:38:02] (03PS8) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [07:38:06] (03CR) 10Paladox: Zuul: Add systemd script for zuul (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [07:38:32] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [07:38:42] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [07:41:42] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (bad PMCID) timed out before a response was received [07:43:42] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received [07:48:29] (03PS9) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [07:48:59] !log schedule 2 hours downtime for all citoid endpoints health on scb boxes [07:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:31] (03PS10) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [07:50:42] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [07:53:52] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received [07:54:17] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3350658 (10Gehel) Nothing more we can test at this point. It looks like elastic2020 is alive and well. Let's cross a few fingers to make sure i... [07:54:42] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [07:56:22] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [07:56:59] !log akosiaris@tin Started deploy [citoid/deploy@ba0db9c]: Remove the bad PMCID test from spec [07:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:37] !log akosiaris@tin Finished deploy [citoid/deploy@ba0db9c]: Remove the bad PMCID test from spec (duration: 00m 38s) [07:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:43] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [07:59:52] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [08:01:42] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [08:01:42] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [08:01:52] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [08:02:33] !log updating HHVM on terbium/wasat to 3.18 [08:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:22] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [08:04:52] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received [08:07:34] (03PS8) 10Paladox: Ircecho: Fix bot not to use carriage returns [puppet] - 10https://gerrit.wikimedia.org/r/359025 [08:09:02] !log akosiaris@tin Started deploy [citoid/deploy@ba0db9c]: Remove the bad PMCID test from spec [08:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:52] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [08:11:53] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received [08:13:52] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [08:15:02] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad PMCID) timed out before a response was received [08:15:52] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [08:16:23] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [08:16:23] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [08:16:46] !log akosiaris@tin Finished deploy [citoid/deploy@ba0db9c]: Remove the bad PMCID test from spec (duration: 07m 44s) [08:16:52] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [08:16:52] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [08:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:11] we should be ok now [08:17:26] !citoid deploy finished. T133696 [08:17:26] T133696: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696 [08:19:18] akosiaris: did you miss the "!log" ? [08:20:22] 10Operations, 10Citoid, 10Services, 10VisualEditor: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#3350683 (10akosiaris) This has caused from the N-th time alerts than can not be acted upon in #wikimedia-operations. As a result and in the interest of keeping the... [08:20:25] :-) [08:21:32] hehe :) [08:22:16] (03CR) 10Alexandros Kosiaris: [C: 031] base/puppet: use "false" instead of "no" with "daemonize" option [puppet] - 10https://gerrit.wikimedia.org/r/359084 (https://phabricator.wikimedia.org/T166371) (owner: 10Dzahn) [08:22:41] (03CR) 10Lucas Werkmeister (WMDE): "According to Special:Version, WikimediaMessages has been updated on Wikidata to include the merge commit of I661b9592d4 (though unfortunat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358343 (https://phabricator.wikimedia.org/T167126) (owner: 10Lucas Werkmeister (WMDE)) [08:27:30] (03PS4) 10Gehel: elasticsearch: remove UseConcMarkSweepGC [puppet] - 10https://gerrit.wikimedia.org/r/358383 (https://phabricator.wikimedia.org/T167636) [08:28:58] (03CR) 10Gehel: [C: 032] elasticsearch: remove UseConcMarkSweepGC [puppet] - 10https://gerrit.wikimedia.org/r/358383 (https://phabricator.wikimedia.org/T167636) (owner: 10Gehel) [08:30:39] (03PS3) 10Gehel: elasticsearch: use $facts['ipaddress'] as the published host [puppet] - 10https://gerrit.wikimedia.org/r/358353 [08:30:44] (03CR) 10Alexandros Kosiaris: "LGTM, 1 nitpick inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/359089 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [08:32:17] (03PS1) 10ArielGlenn: for misc dumps make logging config apply to root logger [dumps] - 10https://gerrit.wikimedia.org/r/359114 [08:32:32] (03CR) 10Gehel: "After discussion with @dcausse, we decided to keep _site_ in bind_networks to ensure that all site addresses are exposed, including LVS ad" [puppet] - 10https://gerrit.wikimedia.org/r/358353 (owner: 10Gehel) [08:32:42] (03CR) 10Gehel: [C: 032] elasticsearch: use $facts['ipaddress'] as the published host [puppet] - 10https://gerrit.wikimedia.org/r/358353 (owner: 10Gehel) [08:33:16] (03PS2) 10ArielGlenn: for misc dumps make logging config apply to root logger [dumps] - 10https://gerrit.wikimedia.org/r/359114 (https://phabricator.wikimedia.org/T167940) [08:34:37] (03PS2) 10Gehel: elasticsearch - cleanup profile::elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/352126 [08:38:29] (03PS1) 10Hashar: jenkins: lower console log spam [puppet] - 10https://gerrit.wikimedia.org/r/359116 [08:38:50] (03CR) 10Gehel: [C: 032] elasticsearch - cleanup profile::elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/352126 (owner: 10Gehel) [08:38:52] 10Operations, 10HHVM, 10Patch-For-Review, 10Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3350703 (10MoritzMuehlenhoff) 05Open>03Resolved All our jessie-based application servers (app servers, API servers, job runners and image scalers, script runners) are now upgra... [08:39:26] (03CR) 10ArielGlenn: [C: 032] for misc dumps make logging config apply to root logger [dumps] - 10https://gerrit.wikimedia.org/r/359114 (https://phabricator.wikimedia.org/T167940) (owner: 10ArielGlenn) [08:40:17] (03CR) 10jerkins-bot: [V: 04-1] for misc dumps make logging config apply to root logger [dumps] - 10https://gerrit.wikimedia.org/r/359114 (https://phabricator.wikimedia.org/T167940) (owner: 10ArielGlenn) [08:40:32] !log restart relforge1001 to validate latest config changes [08:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:36] (03CR) 10ArielGlenn: [C: 032] "recheck" [dumps] - 10https://gerrit.wikimedia.org/r/359114 (https://phabricator.wikimedia.org/T167940) (owner: 10ArielGlenn) [08:43:15] (03PS31) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 (https://phabricator.wikimedia.org/T167871) [08:48:47] (03PS6) 10Gehel: Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [08:50:50] (03CR) 10Alexandros Kosiaris: [C: 032] interface: add rspec boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/340420 (owner: 10Hashar) [08:50:55] (03PS5) 10Alexandros Kosiaris: interface: add rspec boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/340420 (owner: 10Hashar) [08:50:58] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] interface: add rspec boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/340420 (owner: 10Hashar) [08:50:59] bah I might have killed Jenkins somehow :(((( [08:51:26] (03CR) 10Alexandros Kosiaris: [C: 032] systemd: add spec [puppet] - 10https://gerrit.wikimedia.org/r/339176 (owner: 10Hashar) [08:51:36] (03PS4) 10Alexandros Kosiaris: systemd: add spec [puppet] - 10https://gerrit.wikimedia.org/r/339176 (owner: 10Hashar) [08:51:38] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] systemd: add spec [puppet] - 10https://gerrit.wikimedia.org/r/339176 (owner: 10Hashar) [08:52:42] !log ariel@tin Started deploy [dumps/dumps@1734c6d]: history dump rebalance script, fixup for extension script dumps, root logger for misc dumps [08:52:44] !log ariel@tin Finished deploy [dumps/dumps@1734c6d]: history dump rebalance script, fixup for extension script dumps, root logger for misc dumps (duration: 00m 02s) [08:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:04] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3350742 (10akosiaris) Fine by me. @jcrespo I think we can get back to this finally if you are ok with it. Things to do (in that order)... [09:00:41] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3350748 (10jcrespo) * Add dbmonitor1001, dbmonitor2001 to mysql ACLs so that tendril db can be contacted from it [09:04:50] !log Restarting Jenkins. It seems I managed to deadlock it [09:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:22] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3019.60 Read Requests/Sec=6267.00 Write Requests/Sec=62.10 KBytes Read/Sec=25093.20 KBytes_Written/Sec=3118.00 [09:05:46] !log reenable puppet on notebook1002, was disabled for the merge of the zookeeper role refactor two days ago, can be re-enabled now [09:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:39] java.lang.OutOfMemoryError can't be good could it ? [09:08:25] akosiaris: thank you for the reviews/merges etc. I have killed Jenkins though, so build are being reprocessed [09:10:13] !log Jenkins back up and happy. [09:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:04] hashar: :-) [09:11:25] akosiaris: so yesterday evening I have been rebasing a bunch of old puppet patches that were pending [09:11:46] the rspec tests are probably terrible. Though that at least let ones manually check whether the module compiles [09:12:08] I use that often when working on puppet manifests, having inotify to watch files and trigger the rake spec :-} [09:13:09] yeah I had an epiphany yesterday that we maybe we could ship an rspec for every single module that just tests the module classes compile [09:14:27] !log shutting down and deleting data at pc1004 for cloning from db1096 [09:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:42] but that is probably gonna make running tests on puppet repo slower that it already is [09:15:06] so I am a bit reluctant for now [09:16:32] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=67.80 Read Requests/Sec=1.50 Write Requests/Sec=5.60 KBytes Read/Sec=28.80 KBytes_Written/Sec=142.40 [09:17:17] 10Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3350827 (10Marostegui) @Ottomata @elukey what do you guys want to do with this? [09:18:29] akosiaris: I just run them per module. And yeah I would love to find a way to run all the spec in parallel [09:20:24] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3350830 (10Haiku-narrative) Run time will be Thursday 1800EST to Friday 0600EST, we're looking into updating our user agent now [09:21:07] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3350832 (10Haiku-narrative) >>! In T167920#3350164, @GWicke wrote: > You could consider using https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_page_summary_title instead, which is... [09:33:51] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3350859 (10Marostegui) >>! In T167920#3350830, @Haiku-narrative wrote: > Run time will be Thursday 1800EST to Friday 0600EST, > > we're looking into updating our user agent now For the r... [09:41:08] (03PS7) 10Gehel: Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [09:47:44] ACKNOWLEDGEMENT - Check systemd state on analytics1069 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. alexandros kosiaris testing newlines in ACKs ignore me please [09:50:33] (03PS8) 10Gehel: Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [09:55:05] (03PS9) 10Gehel: Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [09:58:23] (03PS1) 10Muehlenhoff: Add Icinga check for depletion of HHVM CLI cache [puppet] - 10https://gerrit.wikimedia.org/r/359120 (https://phabricator.wikimedia.org/T161598) [10:02:27] 10Operations, 10monitoring, 10HHVM, 10Patch-For-Review: Monitor HHVM bytecode cache depletion on mediawiki app servers - https://phabricator.wikimedia.org/T161598#3350897 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:03:09] (03CR) 10DCausse: "looks good," (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [10:08:34] (03PS1) 10Phuedx: pagePreviews: Consume HTML from RESTBase endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359123 (https://phabricator.wikimedia.org/T165018) [10:16:54] !log rollout remaining systemd updates from jessie point release [10:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:19] 10Operations, 10Analytics, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3350950 (10ema) >>! In T118365#3349563, @Nuria wrote: > mmm... looking at pageview API dashboard I can see some of lawful traffic (spikes we could have handled) seems to have b... [10:38:14] hashar: how are we with self-merging config changes that affect the beta cluster only? https://gerrit.wikimedia.org/r/#/c/359123/ [10:39:35] phuedx: I just CR+2 them [10:39:47] phuedx: then head to the production deployment server and fetch/rebase [10:39:57] but I skip the actual deployment to prod cluster [10:43:08] (03PS10) 10Gehel: Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [10:43:23] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [10:48:13] 10Operations, 10Ops-Access-Requests, 10Analytics: analytics-privatedata-users access for ema - https://phabricator.wikimedia.org/T167952#3350991 (10ema) [10:48:21] 10Operations, 10Ops-Access-Requests, 10Analytics: analytics-privatedata-users access for ema - https://phabricator.wikimedia.org/T167952#3351003 (10ema) p:05Triage>03Normal [10:48:51] (03PS1) 10Ema: admin: add ema to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/359125 (https://phabricator.wikimedia.org/T167952) [10:49:51] (03CR) 10Volans: "LGTM, few minor and optional comments inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/359120 (https://phabricator.wikimedia.org/T161598) (owner: 10Muehlenhoff) [10:55:33] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [10:58:05] (03CR) 10Alexandros Kosiaris: [C: 032] Ircecho: Fix bot not to use carriage returns [puppet] - 10https://gerrit.wikimedia.org/r/359025 (owner: 10Paladox) [10:58:09] (03PS9) 10Alexandros Kosiaris: Ircecho: Fix bot not to use carriage returns [puppet] - 10https://gerrit.wikimedia.org/r/359025 (owner: 10Paladox) [10:58:21] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Ircecho: Fix bot not to use carriage returns [puppet] - 10https://gerrit.wikimedia.org/r/359025 (owner: 10Paladox) [11:10:55] Ignoreignore [11:11:05] Ignore\nignore [11:11:17] looks like it's working [11:11:34] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [11:16:25] (03PS1) 10Alexandros Kosiaris: Assign kubernetes::staging::worker role [puppet] - 10https://gerrit.wikimedia.org/r/359127 [11:18:33] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [11:23:47] (03CR) 10DCausse: [C: 031] Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [11:43:36] hashar: i'm going to +2 that change and fetch/rebase on deployment.eqiad [11:43:40] thanks [11:46:39] (03CR) 10Phuedx: [C: 032] "Following Hashar's advice, I'll merge this and then rebase the production deployment host as it's Beta Cluster only." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359123 (https://phabricator.wikimedia.org/T165018) (owner: 10Phuedx) [11:46:53] RECOVERY - puppet last run on ms-be1026 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [11:47:54] (03Merged) 10jenkins-bot: pagePreviews: Consume HTML from RESTBase endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359123 (https://phabricator.wikimedia.org/T165018) (owner: 10Phuedx) [11:48:03] (03CR) 10jenkins-bot: pagePreviews: Consume HTML from RESTBase endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359123 (https://phabricator.wikimedia.org/T165018) (owner: 10Phuedx) [11:51:11] done [12:04:06] (03PS11) 10Gehel: Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [12:05:57] (03CR) 10Gehel: [C: 032] Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [12:14:05] !log restart elasticsearch on relforge1001 to validate latest config changes [12:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:32] 10Operations, 10Analytics, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351181 (10BBlack) That top client appears to be CrossRefEventDataBot from https://www.crossref.org/services/event-data/ , running on a hosted server at Hetzner in DE. [12:35:27] 10Operations, 10Analytics, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351184 (10ema) >>! In T118365#3351181, @BBlack wrote: > That top client appears to be CrossRefEventDataBot from https://www.crossref.org/services/event-data/ , running on a hos... [12:38:54] (03CR) 10Muehlenhoff: Add Icinga check for depletion of HHVM CLI cache (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/359120 (https://phabricator.wikimedia.org/T161598) (owner: 10Muehlenhoff) [12:39:16] (03PS2) 10Muehlenhoff: Add Icinga check for depletion of HHVM CLI cache [puppet] - 10https://gerrit.wikimedia.org/r/359120 (https://phabricator.wikimedia.org/T161598) [12:40:08] (03CR) 10jerkins-bot: [V: 04-1] Add Icinga check for depletion of HHVM CLI cache [puppet] - 10https://gerrit.wikimedia.org/r/359120 (https://phabricator.wikimedia.org/T161598) (owner: 10Muehlenhoff) [12:42:03] (03PS3) 10Muehlenhoff: Add Icinga check for depletion of HHVM CLI cache [puppet] - 10https://gerrit.wikimedia.org/r/359120 (https://phabricator.wikimedia.org/T161598) [12:47:26] (03PS3) 10Faidon Liambotis: base/puppet: use "false" instead of "no" with "daemonize" option [puppet] - 10https://gerrit.wikimedia.org/r/359084 (https://phabricator.wikimedia.org/T166371) (owner: 10Dzahn) [12:47:37] (03CR) 10Faidon Liambotis: [V: 032 C: 032] base/puppet: use "false" instead of "no" with "daemonize" option [puppet] - 10https://gerrit.wikimedia.org/r/359084 (https://phabricator.wikimedia.org/T166371) (owner: 10Dzahn) [12:48:10] (03PS2) 10Ema: VCL: add Retry-After header to 429 responses [puppet] - 10https://gerrit.wikimedia.org/r/358965 (https://phabricator.wikimedia.org/T163233) [12:51:45] (03PS3) 10Ema: VCL: add Retry-After header to 429 responses [puppet] - 10https://gerrit.wikimedia.org/r/358965 (https://phabricator.wikimedia.org/T163233) [12:52:58] (03CR) 10Ema: [C: 032] VCL: add Retry-After header to 429 responses [puppet] - 10https://gerrit.wikimedia.org/r/358965 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170615T1300). [13:00:04] dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:14] o/ [13:00:15] * aude waves [13:00:21] * hashar hides [13:00:22] suppose i could do swat today [13:00:54] if at all possible please self serve! ;-:} [13:01:07] I am dealing with some paper work, but I am around for assistance as needed [13:01:12] ok [13:02:14] !log starting elasticsearch upgrade to 5.3.2 on relforge cluster - T163708 [13:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:23] T163708: Upgrade the production search cluster to elastic 5.3.2 - https://phabricator.wikimedia.org/T163708 [13:02:34] (03CR) 10Aude: [C: 032] Enable BM25 for Chinese wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [13:08:13] PROBLEM - Check systemd state on mw1295 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:08:41] (03PS6) 10Aude: Enable BM25 for Chinese wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [13:08:50] (03CR) 10Aude: [C: 032] Enable BM25 for Chinese wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [13:09:03] RECOVERY - Check systemd state on mw1295 is OK: OK - running: The system is fully operational [13:09:44] 10Operations, 10Release-Engineering-Team, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351232 (10Marostegui) [13:10:10] (03Merged) 10jenkins-bot: Enable BM25 for Chinese wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [13:10:19] (03CR) 10jenkins-bot: Enable BM25 for Chinese wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [13:10:23] PROBLEM - HHVM rendering on mw1295 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [13:10:24] PROBLEM - Nginx local proxy to apache on mw1295 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time [13:10:38] 10Operations, 10Release-Engineering-Team, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351245 (10Marostegui) p:05Triage>03Normal [13:11:37] 10Operations, 10Release-Engineering-Team, 10cloud-services-team, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351232 (10Marostegui) [13:12:07] (03PS2) 10Alexandros Kosiaris: Assign kubernetes::staging::worker role [puppet] - 10https://gerrit.wikimedia.org/r/359127 [13:12:23] RECOVERY - Nginx local proxy to apache on mw1295 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.063 second response time [13:12:48] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Assign kubernetes::staging::worker role [puppet] - 10https://gerrit.wikimedia.org/r/359127 (owner: 10Alexandros Kosiaris) [13:13:23] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 73641 bytes in 0.142 second response time [13:13:26] !log aude@tin Synchronized tests/cirrusTest.php: (no justification provided) (duration: 00m 45s) [13:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:43] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable BM25 for Chinese wikis (duration: 00m 44s) [13:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:08] dcausse: ^ [13:15:14] aude: looking [13:15:38] (03CR) 10Aude: [C: 032] Add “Constraints” section for constraint statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358343 (https://phabricator.wikimedia.org/T167126) (owner: 10Lucas Werkmeister (WMDE)) [13:15:41] aude: patch seems effective, I'll reindex the wikis now [13:15:47] thanks [13:15:50] thanks! [13:16:37] (03Merged) 10jenkins-bot: Add “Constraints” section for constraint statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358343 (https://phabricator.wikimedia.org/T167126) (owner: 10Lucas Werkmeister (WMDE)) [13:19:03] !log [cirrus] reindexing all zh wikis (eqiad & codfw) [13:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:03] PROBLEM - salt-minion processes on kubestage1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:05] PROBLEM - Check systemd state on kubestage1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:13] PROBLEM - dhclient process on kubestage1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:14] PROBLEM - configured eth on kubestage1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:23] PROBLEM - DPKG on kubestage1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:23] PROBLEM - puppet last run on kubestage1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:23] PROBLEM - MD RAID on kubestage1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:43] PROBLEM - Disk space on kubestage1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:22:13] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [13:23:02] !log aude@tin Synchronized wmf-config/Wikibase.php: Add constraints statements section on Wikidata T167126 (duration: 00m 43s) [13:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:11] T167126: Create separate section for constraint statements on properties - https://phabricator.wikimedia.org/T167126 [13:24:13] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 953 bytes in 0.077 second response time [13:24:33] think we are done [13:24:53] !log elasticsearch upgrade to 5.3.2 on relforge cluster completed, cluster still recovering - T163708 [13:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:02] T163708: Upgrade the production search cluster to elastic 5.3.2 - https://phabricator.wikimedia.org/T163708 [13:25:24] (03PS1) 10Alexandros Kosiaris: kubestage: Set the correct partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/359133 (https://phabricator.wikimedia.org/T166264) [13:25:51] 10Operations, 10Release-Engineering-Team, 10cloud-services-team, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351232 (10Andrew) In general we're trying to make wikitech (and labtestwikitech) more like normal wikis... they're currentl... [13:26:03] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubestage: Set the correct partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/359133 (https://phabricator.wikimedia.org/T166264) (owner: 10Alexandros Kosiaris) [13:26:53] RECOVERY - Check systemd state on analytics1069 is OK: OK - running: The system is fully operational [13:28:13] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [13:30:13] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 973 bytes in 0.080 second response time [13:32:54] (03CR) 10jenkins-bot: Add “Constraints” section for constraint statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358343 (https://phabricator.wikimedia.org/T167126) (owner: 10Lucas Werkmeister (WMDE)) [13:34:13] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [13:35:31] akosiaris, mobrovac: ^ [13:35:43] known known [13:35:47] ok :) [13:35:57] we are struggling with the new cassandra driver [13:36:19] good luck! [13:36:26] heh indeed [13:36:28] mobrovac: try uber :-P [13:36:28] thnx ema! [13:36:33] lol volans [13:36:54] uber doesn't work here, easy taxi does though :P [13:37:14] 10Operations: Look into feasibility of disabling sha-1 host keys on our ssh daemons - https://phabricator.wikimedia.org/T167966#3351354 (10BBlack) [13:38:13] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 973 bytes in 0.075 second response time [13:39:34] (03PS4) 10Ema: Redirect /api/rest_v1 to RESTBase docs page [puppet] - 10https://gerrit.wikimedia.org/r/306979 (https://phabricator.wikimedia.org/T125226) (owner: 10Ppchelko) [13:40:04] (03PS1) 10Lucas Werkmeister (WMDE): Add “Constraints” section on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359135 (https://phabricator.wikimedia.org/T167126) [13:41:13] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [13:42:33] (03CR) 10Hashar: "Almost good :-} I guess you tried it out on labs already?" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [13:43:23] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Zuul: Migrate zuul-server behind systemd service - https://phabricator.wikimedia.org/T167845#3351369 (10hashar) a:03Paladox @Paladox is kindly dealing with it \O/ [13:45:19] 10Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3351373 (10Ottomata) @elukey might have other opinions, but I'm inclined to try our best to expedite the ordering of new hardware, rather than worry about the BBU. I... [13:49:15] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 973 bytes in 0.074 second response time [13:52:14] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [13:54:14] (03CR) 10Aude: [C: 031] Add “Constraints” section on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359135 (https://phabricator.wikimedia.org/T167126) (owner: 10Lucas Werkmeister (WMDE)) [13:56:09] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor minor nitpick, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357717 (owner: 10Faidon Liambotis) [13:58:09] (03PS4) 10Faidon Liambotis: wmflib: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357717 [13:59:17] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3351402 (10Benoit_Rochon) Hello Amir. Ok, we are currently working on it. Should... [14:02:20] (03PS8) 10Gehel: Add Shiny Server module and Discovery Dashboards role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) (owner: 10Bearloga) [14:05:05] (03PS1) 10Ema: VCL: same rate-limit for api.php and restbase [puppet] - 10https://gerrit.wikimedia.org/r/359137 (https://phabricator.wikimedia.org/T163233) [14:05:22] (03PS9) 10Gehel: Add Shiny Server module and Discovery Dashboards role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) (owner: 10Bearloga) [14:06:41] (03CR) 10Gehel: [C: 032] Add Shiny Server module and Discovery Dashboards role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) (owner: 10Bearloga) [14:10:29] (03PS3) 10Krinkle: Remove EtcdConfig from beta cluster for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351767 (owner: 10Tim Starling) [14:10:32] (03CR) 10Krinkle: [C: 032] Remove EtcdConfig from beta cluster for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351767 (owner: 10Tim Starling) [14:11:24] (03CR) 10Alexandros Kosiaris: [C: 032] wmflib: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357717 (owner: 10Faidon Liambotis) [14:11:29] (03PS5) 10Alexandros Kosiaris: wmflib: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357717 (owner: 10Faidon Liambotis) [14:11:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] wmflib: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357717 (owner: 10Faidon Liambotis) [14:11:37] Rolling out the above beta-only change in beta. Once okay there, will pull on tin, verify no-op on mwdebug and scap-file. [14:11:46] (03Merged) 10jenkins-bot: Remove EtcdConfig from beta cluster for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351767 (owner: 10Tim Starling) [14:11:59] (03CR) 10jenkins-bot: Remove EtcdConfig from beta cluster for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351767 (owner: 10Tim Starling) [14:12:47] (03PS11) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [14:12:51] (03CR) 10Paladox: Zuul: Add systemd script for zuul (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [14:13:37] (03CR) 10Paladox: ">" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [14:15:48] (03PS2) 10Ema: VCL: same rate-limit for api.php and restbase [puppet] - 10https://gerrit.wikimedia.org/r/359137 (https://phabricator.wikimedia.org/T163233) [14:16:31] 10Operations, 10cloud-services-team, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351416 (10jcrespo) ``` does labstestweb2001 have echo enabled? Looks like yes, or at least it's installed there. ok, so th... [14:16:46] 10Operations, 10DBA, 10cloud-services-team, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351418 (10jcrespo) [14:22:57] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Zuul: Migrate zuul-server behind systemd service - https://phabricator.wikimedia.org/T167845#3351446 (10Paladox) Thanks :) [14:23:51] 10Operations, 10DBA, 10cloud-services-team, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351450 (10Marostegui) Should the missing user be created then? [14:24:58] 10Operations, 10DBA, 10cloud-services-team, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351455 (10jcrespo) Is there a file where grants for this are tracked? Should it be shared between labswiki and labstestwiki? [14:34:46] (03CR) 10Alexandros Kosiaris: [C: 032] "Tested the servermon reporter live, indeed the self are not required, merging" [puppet] - 10https://gerrit.wikimedia.org/r/357718 (owner: 10Faidon Liambotis) [14:34:51] (03PS4) 10Alexandros Kosiaris: puppetmaster: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357718 (owner: 10Faidon Liambotis) [14:35:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppetmaster: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357718 (owner: 10Faidon Liambotis) [14:38:40] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: no-op Ifc7b1ea80 - Remove EtcdConfig from beta (duration: 00m 45s) [14:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:42] !log killing stuck replication on maps1001 [14:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:11] (03CR) 10Ottomata: [C: 032] admin: add ema to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/359125 (https://phabricator.wikimedia.org/T167952) (owner: 10Ema) [14:41:17] (03PS2) 10Ottomata: admin: add ema to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/359125 (https://phabricator.wikimedia.org/T167952) (owner: 10Ema) [14:41:20] (03CR) 10Ottomata: [V: 032 C: 032] admin: add ema to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/359125 (https://phabricator.wikimedia.org/T167952) (owner: 10Ema) [14:41:33] (03CR) 10Alexandros Kosiaris: [C: 032] hiera_lookup: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357719 (owner: 10Faidon Liambotis) [14:41:37] (03PS5) 10Alexandros Kosiaris: hiera_lookup: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357719 (owner: 10Faidon Liambotis) [14:41:39] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] hiera_lookup: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357719 (owner: 10Faidon Liambotis) [14:43:53] 10Operations, 10DBA, 10cloud-services-team, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351529 (10Marostegui) The only file where I could see it was on `wikitech.sql.erb` and if we add it to labstest I would suggest we add it there... [14:45:52] 10Operations, 10HyperSwitch, 10RESTBase-API, 10Traffic, 10Services (next): Respect host header in RESTBase, and redirect v1 to v1/ - https://phabricator.wikimedia.org/T167972#3351561 (10GWicke) [14:46:12] 10Operations, 10HyperSwitch, 10RESTBase-API, 10Traffic, 10Services (next): Respect host header in RESTBase, and redirect /rest_v1 to /rest_v1/ - https://phabricator.wikimedia.org/T167972#3351574 (10GWicke) [14:46:58] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Oh, these settings are a mess. :-(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359135 (https://phabricator.wikimedia.org/T167126) (owner: 10Lucas Werkmeister (WMDE)) [14:48:43] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3351589 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete. [14:50:37] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3351593 (10Marostegui) Thanks! Will close the ticket once the rebuilt is done: ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 2% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600... [14:51:57] 10Operations, 10Discovery, 10Maps, 10Interactive-Sprint: import_waterlines is broken - https://phabricator.wikimedia.org/T159771#3351615 (10Gehel) This has actually been fixed as part of T159631. I just checked, import_waterlines completed successfully on June 1st. [14:52:16] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3351619 (10Papaul) Disk wipe in progress [14:54:14] (03PS1) 10Marostegui: wikitech.sql.erb: Add labtestwiki database [puppet] - 10https://gerrit.wikimedia.org/r/359149 (https://phabricator.wikimedia.org/T167961) [14:54:27] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351637 (10jcrespo) maybe we should use the core grants file? or does wikitech have other different grants from core? if ye... [14:56:22] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351639 (10Marostegui) >>! In T167961#3351637, @jcrespo wrote: > maybe we should use the core grants file? or does wikitech... [14:57:15] (03CR) 10Jcrespo: "I am unsure about this. Maybe we should embody core.sql? I think that uses %wiki% for grants. would that be possible? (see comment on tick" [puppet] - 10https://gerrit.wikimedia.org/r/359149 (https://phabricator.wikimedia.org/T167961) (owner: 10Marostegui) [15:01:59] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351653 (10jcrespo) We do not have to wait, is it easy to see if everthing on core covers wikitech? What about the rest- I... [15:02:29] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Patch-For-Review: analytics-privatedata-users access for ema - https://phabricator.wikimedia.org/T167952#3351654 (10ema) 05Open>03Resolved a:03ema Done! [15:02:54] RECOVERY - salt-minion processes on kubestage1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:02:54] RECOVERY - dhclient process on kubestage1001 is OK: PROCS OK: 0 processes with command name dhclient [15:02:54] RECOVERY - configured eth on kubestage1001 is OK: OK - interfaces up [15:03:06] RECOVERY - DPKG on kubestage1001 is OK: All packages OK [15:03:06] RECOVERY - MD RAID on kubestage1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [15:03:06] RECOVERY - Disk space on kubestage1001 is OK: DISK OK [15:04:26] (03PS3) 10Ema: VCL: same rate-limit for api.php and restbase [puppet] - 10https://gerrit.wikimedia.org/r/359137 (https://phabricator.wikimedia.org/T163233) [15:04:33] (03CR) 10Ema: [V: 032 C: 032] VCL: same rate-limit for api.php and restbase [puppet] - 10https://gerrit.wikimedia.org/r/359137 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [15:07:32] (03PS12) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [15:08:00] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351662 (10Marostegui) >>! In T167961#3351653, @jcrespo wrote: > We do not have to wait, is it easy to see if everthing on... [15:08:42] (03Abandoned) 10Marostegui: wikitech.sql.erb: Add labtestwiki database [puppet] - 10https://gerrit.wikimedia.org/r/359149 (https://phabricator.wikimedia.org/T167961) (owner: 10Marostegui) [15:10:00] (03CR) 10Jcrespo: "I said I wasn't sure, not that we shouldn't do it! We need feedback from cloud, but this should be reverted." [puppet] - 10https://gerrit.wikimedia.org/r/359149 (https://phabricator.wikimedia.org/T167961) (owner: 10Marostegui) [15:10:34] (03CR) 10Paladox: "Tested and works now." [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [15:11:33] (03CR) 10Jcrespo: "s/reverted/Restored back to discussion/" [puppet] - 10https://gerrit.wikimedia.org/r/359149 (https://phabricator.wikimedia.org/T167961) (owner: 10Marostegui) [15:13:13] (03CR) 10Marostegui: "Not sure, this would be covered by core grants regex on %wik%. The only thing we'd need is to decide what to do with the oahtreader before" [puppet] - 10https://gerrit.wikimedia.org/r/359149 (https://phabricator.wikimedia.org/T167961) (owner: 10Marostegui) [15:15:20] (03Restored) 10Marostegui: wikitech.sql.erb: Add labtestwiki database [puppet] - 10https://gerrit.wikimedia.org/r/359149 (https://phabricator.wikimedia.org/T167961) (owner: 10Marostegui) [15:15:38] PROBLEM - Host mc2036 is DOWN: PING CRITICAL - Packet loss = 100% [15:16:47] 10Operations: terbium maintenance cron "processEchoEmailBatch.php" is getting "access denied" from database - https://phabricator.wikimedia.org/T167373#3351688 (10jcrespo) [15:16:50] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351686 (10jcrespo) [15:16:58] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:18:01] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351232 (10jcrespo) [15:18:03] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3351691 (10jcrespo) [15:19:09] 10Operations, 10MediaWiki-extensions-PageAssessments, 10Wikimedia-General-or-Unknown, 10Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3351694 (10jcrespo) Is this still happening? [15:19:24] 10Operations, 10ops-codfw: mc2036 link down - https://phabricator.wikimedia.org/T167975#3351695 (10Marostegui) [15:19:34] 10Operations, 10ops-codfw: mc2036 link down - https://phabricator.wikimedia.org/T167975#3351708 (10Marostegui) p:05Triage>03Normal [15:19:42] 10Operations, 10ops-codfw: mc2036 eth0 link down - https://phabricator.wikimedia.org/T167975#3351695 (10Marostegui) [15:19:49] 10Operations, 10MediaWiki-extensions-PageAssessments: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#3351716 (10jcrespo) [15:19:54] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Wikimedia-log-errors: Cronjobs attempting to connect to labstestweb2001 - https://phabricator.wikimedia.org/T167961#3351714 (10jcrespo) [15:21:18] PROBLEM - IPsec on mc1036 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2036_v4 [15:21:58] RECOVERY - puppet last run on kubestage1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:30:24] (03CR) 10Jcrespo: "I think you were right, sorry: T167973#3351710" [puppet] - 10https://gerrit.wikimedia.org/r/359149 (https://phabricator.wikimedia.org/T167961) (owner: 10Marostegui) [15:31:08] RECOVERY - Host mc2036 is UP: PING OK - Packet loss = 0%, RTA = 36.31 ms [15:31:18] RECOVERY - IPsec on mc1036 is OK: Strongswan OK - 1 ESP OK [15:32:23] 10Operations, 10Discovery, 10Maps, 10Interactive-Sprint: import_waterlines is broken - https://phabricator.wikimedia.org/T159771#3351736 (10debt) 05Open>03Resolved Yay, closing! [15:32:48] (03Abandoned) 10Marostegui: wikitech.sql.erb: Add labtestwiki database [puppet] - 10https://gerrit.wikimedia.org/r/359149 (https://phabricator.wikimedia.org/T167961) (owner: 10Marostegui) [15:33:44] 10Operations, 10ops-codfw: mc2036 eth0 link down - https://phabricator.wikimedia.org/T167975#3351740 (10Marostegui) 05Open>03Resolved Back! ``` root@mc2036:~# mii-tool eth0 eth0: negotiated, link ok ``` [15:41:08] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:44:42] 10Operations, 10Analytics, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351748 (10Nuria) Thanks for the prompt response, when the number of changes I did not see when these took effect, it is true that we do not see on our end 429s at all times, bu... [15:46:01] 10Operations, 10Analytics, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351751 (10Nuria) If you look at 404s however, looks like the throttling had a positive effect on removing "garbaage-y" traffic. [15:50:49] 10Operations, 10Analytics-Kanban, 10User-Elukey: New analytic hosts with BBU learning cycle enabled - https://phabricator.wikimedia.org/T167809#3351771 (10Nuria) [15:51:00] 10Operations, 10Analytics-Kanban, 10User-Elukey: New analytic hosts with BBU learning cycle enabled - https://phabricator.wikimedia.org/T167809#3345083 (10Nuria) Puting on kanban for @elukey to look at [15:52:44] (03PS1) 10Marostegui: mariadb: wikitech servers to use core grants [puppet] - 10https://gerrit.wikimedia.org/r/359152 (https://phabricator.wikimedia.org/T167961) [15:54:38] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review: Design and implement a Kubernetes-based staging environment. (stretch) - https://phabricator.wikimedia.org/T162045#3351781 (10akosiaris) [15:55:15] 10Operations, 10Goal, 10Kubernetes: Prepare to service applications from kubernetes - https://phabricator.wikimedia.org/T162039#3351785 (10akosiaris) [15:55:17] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review: Design and implement a Kubernetes-based staging environment. (stretch) - https://phabricator.wikimedia.org/T162045#3150713 (10akosiaris) 05Open>03Resolved a:03akosiaris kubestage1001, kubestage1002 are up and running, calico is working fine, BGP... [15:56:29] (03CR) 10Jcrespo: "Wait, if we redirect we should just delete grants/wikitech ? or is it used elsewhere?" [puppet] - 10https://gerrit.wikimedia.org/r/359152 (https://phabricator.wikimedia.org/T167961) (owner: 10Marostegui) [15:58:07] (03CR) 10Jcrespo: "I just saw the comment, nah, just delete, we have it on git if we need to look back. We may need to delete the private key. And add Andrew" [puppet] - 10https://gerrit.wikimedia.org/r/359152 (https://phabricator.wikimedia.org/T167961) (owner: 10Marostegui) [16:00:01] (03PS2) 10Marostegui: mariadb: wikitech servers to use core grants [puppet] - 10https://gerrit.wikimedia.org/r/359152 (https://phabricator.wikimedia.org/T167961) [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170615T1600). Please do the needful. [16:00:04] twentyafterfour: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:01:34] (03CR) 10Jcrespo: [C: 031] "Let's wait for cloud team to review!" [puppet] - 10https://gerrit.wikimedia.org/r/359152 (https://phabricator.wikimedia.org/T167961) (owner: 10Marostegui) [16:03:16] (03CR) 10Marostegui: "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/6785/" [puppet] - 10https://gerrit.wikimedia.org/r/359152 (https://phabricator.wikimedia.org/T167961) (owner: 10Marostegui) [16:03:38] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw: labtestpuppetmaster2001 switch port configuration - https://phabricator.wikimedia.org/T167321#3324750 (10ayounsi) The switch was showing "Carrier transitions" errors on that interface and no inbound traffic. We tried changing the... [16:09:09] (03PS4) 10Andrew Bogott: labspuppetbackend: add api methods to query by role [puppet] - 10https://gerrit.wikimedia.org/r/359041 (https://phabricator.wikimedia.org/T151522) [16:10:41] (03CR) 10Andrew Bogott: [C: 032] labspuppetbackend: add api methods to query by role [puppet] - 10https://gerrit.wikimedia.org/r/359041 (https://phabricator.wikimedia.org/T151522) (owner: 10Andrew Bogott) [16:17:26] (03PS2) 10Framawiki: Create a FeaturedFeed for the RAW bulletin on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358772 (https://phabricator.wikimedia.org/T167617) [16:21:25] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3351989 (10MarcoAurelio) 05Open>03stalled p:05Normal>03Low Unfortunately busy days are ahead and I'll not be able to handle this request. My plan is to deci... [16:24:43] (03PS4) 10Krinkle: [WIP] mediawiki: Fix error page template issues [puppet] - 10https://gerrit.wikimedia.org/r/358430 (https://phabricator.wikimedia.org/T113114) [16:33:21] (03PS5) 10Krinkle: mediawiki: Fix error page template issues [puppet] - 10https://gerrit.wikimedia.org/r/358430 (https://phabricator.wikimedia.org/T113114) [16:33:45] andrewbogott: Could you try https://gerrit.wikimedia.org/r/#/c/358430/5/modules/mediawiki/templates/errorpage.html.erb at some point? [16:33:55] Should fix the dynamicproxy errorpage issues [16:34:22] sure. I think I had that mentally filed as still a wip [16:34:40] (03PS12) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [16:34:45] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Fix error page template issues [puppet] - 10https://gerrit.wikimedia.org/r/358430 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:36:37] (03CR) 10jerkins-bot: [V: 04-1] varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:37:38] (03PS6) 10Krinkle: mediawiki: Fix error page template issues [puppet] - 10https://gerrit.wikimedia.org/r/358430 (https://phabricator.wikimedia.org/T113114) [16:37:50] (03PS13) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [16:38:13] andrewbogott: Yeah, it was. You merged it, then reverted, fixed up and re-merged. [16:38:27] But I still noticed 2 minor issues with the result. Not hard-failures, but mistake from my side [16:38:47] Some bug I ran into with Puppet 3 apparently, where undef doesn't always become nil in ERB and thus isn't falsey in the if-check [16:39:01] Which is why https://tools.wmflabs.org/.error/banned.html says "undef" [16:39:06] instead of noething [16:41:53] (03CR) 10Krinkle: "Re-applied on Beta and still passes on puppet compiler: https://puppet-compiler.wmflabs.org/6741/" [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:42:11] I can merge it now if you have time to test &c. I can't give it my actual attention though. [16:42:17] Want me to go ahead and merge? [16:42:20] andrewbogott: Sure. [16:42:35] (03CR) 10Andrew Bogott: [C: 032] mediawiki: Fix error page template issues [puppet] - 10https://gerrit.wikimedia.org/r/358430 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:42:37] It's only used by dynamicproxy still, so low impact [16:44:44] (03PS1) 10EBernhardson: Enable token_count_router for cirrus queries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359183 (https://phabricator.wikimedia.org/T152094) [16:47:19] hm… of course I didn't test my canary beforehand so now I don't know if this broke it or if it was always broken [16:47:22] Krinkle, what do you see? [16:47:47] andrewbogott: It's not applied yet afaik [16:48:05] https://tools.wmflabs.org/.error/errorpage.html?_=1 https://tools.wmflabs.org/.error/banned.html?_=2 [16:48:27] andrewbogott: Do you see a diff in the puppet run? [16:48:30] so we're talking about the tools proxy now, not the labs proxy? [16:48:54] I mean, do you have a test case for the labs proxy? [16:49:05] andrewbogott: Eh either is fine I suppose. I don't know where the labs proxy is exposed [16:51:00] ok, it should be applied now [16:51:04] on tools [16:52:56] andrewbogott: I still don't see it applied [16:53:01] Ah, there it is [16:53:10] Looks good. [16:53:12] All fixed [16:53:18] 10Operations, 10netops: Merge AS14907 with AS43281 - https://phabricator.wikimedia.org/T167840#3352086 (10ayounsi) How I understand it, the increased complexity of running two "networks" outweighs its advantages. And our customers are networks we manage and have control over. A single ASNs (using confederation... [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170615T1700). [17:00:14] No ORES [17:00:27] no parsoid [17:08:35] (03CR) 10Chad: releases: add new role/profile, add backups, install jenkins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/359089 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [17:12:34] (03CR) 10Dzahn: releases: add new role/profile, add backups, install jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/359089 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [17:16:42] (03CR) 10Chad: "https://puppet-compiler.wmflabs.org/6789/ - just manifest changes, no actual on-disk changes to files" [puppet] - 10https://gerrit.wikimedia.org/r/341729 (owner: 10Chad) [17:17:28] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3352141 (10RobH) [17:19:31] (03PS2) 10Dzahn: releases: add new role/profile, add backups, install jenkins [puppet] - 10https://gerrit.wikimedia.org/r/359089 (https://phabricator.wikimedia.org/T164030) [17:20:33] (03CR) 10Dzahn: releases: add new role/profile, add backups, install jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/359089 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [17:22:57] jouncebot: next [17:22:57] In 0 hour(s) and 37 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170615T1800) [17:24:26] (03Abandoned) 10Chad: Install jenkins on releases.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/359029 (owner: 10Chad) [17:24:37] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3352194 (10Benoit_Rochon) Hello @Amire80 I got the translations of the namespace... [17:24:52] (03CR) 10BBlack: "Could we change the super-long lines to puppet's HEREDOC format? I think you can do it with indentation as well by using a style like:" [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [17:25:30] #til about puppet heredoc [17:25:35] (03CR) 10Dzahn: releases: add new role/profile, add backups, install jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/359089 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [17:27:34] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3352201 (10RobH) a:05Cmjohnson>03chasemp [17:27:38] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3352203 (10jcrespo) [17:27:47] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3352141 (10RobH) Racking proposal: existing labstore systems use the vlan-labs-support1, so these have to be racked in rows A and C, as those rows have that... [17:27:54] !log disabling puppet on cp*wmnet to avoid puppet races on https://gerrit.wikimedia.org/r/#/c/341729 merge [17:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:07] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3156297 (10jcrespo) taking also db1099 and db1101 T167567 [17:28:29] !log install2002 - temp disabling puppet and applying hot fix to debug install issue for papaul [17:28:32] (03CR) 10BBlack: [C: 032] Move all ssl certs to the module and out of files/ [puppet] - 10https://gerrit.wikimedia.org/r/341729 (owner: 10Chad) [17:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:39] (03PS5) 10BBlack: Move all ssl certs to the module and out of files/ [puppet] - 10https://gerrit.wikimedia.org/r/341729 (owner: 10Chad) [17:28:41] (03CR) 10BBlack: [V: 032 C: 032] Move all ssl certs to the module and out of files/ [puppet] - 10https://gerrit.wikimedia.org/r/341729 (owner: 10Chad) [17:30:07] PROBLEM - Check systemd state on kubestage1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:31:47] PROBLEM - puppet last run on mw1188 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.eqiad.wmnet.crt] [17:32:27] PROBLEM - puppet last run on mw2132 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [17:32:37] PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [17:32:37] PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [17:33:17] heh [17:33:37] should've done a selection across more than just the caches I guess, to capture all the possible races [17:33:57] Wherps [17:34:08] I didn't think about those internal certs [17:34:13] checking mw2209 to be sure it's just the race [17:34:37] RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:34:38] yes, just the race, so not a huge deal other than minor spam here [17:36:20] :) [17:36:28] that was nice, getting them out of files, thanks [17:38:51] 10Operations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841#3352281 (10ayounsi) Best practices for confederation is to limit IGP within each confederation. The main advantages are reducing the blast radius if OSPF miss-behaves, and increasing convergence speed.... [17:39:27] RECOVERY - puppet last run on mw2132 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [17:39:37] RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:41:40] mutante bblack: I'll drop a note to the ops list bragging about killing it and reminding people not to bring it back [17:41:47] RECOVERY - puppet last run on mw1188 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:42:17] RainbowSprinkles: cool [17:45:53] (03PS1) 10Jcrespo: mariadb: Pool db1099 and db1101 as temporary substitutes of pc2/3 [puppet] - 10https://gerrit.wikimedia.org/r/359195 (https://phabricator.wikimedia.org/T167567) [17:46:28] (03PS14) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [17:48:33] (03CR) 10jerkins-bot: [V: 04-1] varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [17:49:05] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1099 and db1101 as temporary substitutes of pc2/3 [puppet] - 10https://gerrit.wikimedia.org/r/359195 (https://phabricator.wikimedia.org/T167567) (owner: 10Jcrespo) [17:50:05] !log install2002 - re-enabled puppet, reverted live hack, back to normal (issue seems to be NIC or other) [17:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:20] the puppet errors are mine, and are harmless [17:51:43] I am in process of setting up db1099 and db1101 [17:51:46] thanks [17:52:10] no, not those pupet errors ^ [17:52:18] the ones that will appear here in a second [17:52:47] PROBLEM - puppet last run on db1099 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat] [17:52:53] that one [17:52:53] ok, yep, aware that the cert ones are a different thing [17:52:56] gothca [17:52:58] and another [17:53:35] (03PS15) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [17:53:55] I need now to mass-hit pending to disable pages [17:54:19] PROBLEM - puppet last run on db1101 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat] [17:54:40] (03PS1) 10Dzahn: add IPv6 records for releases1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/359198 (https://phabricator.wikimedia.org/T164030) [17:54:49] I will fix them now [17:54:59] (03CR) 10jerkins-bot: [V: 04-1] varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [17:55:43] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352337 (10RobH) [17:56:16] (03PS16) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [17:58:01] (03CR) 10jerkins-bot: [V: 04-1] varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [17:58:23] (03PS1) 10Dzahn: rcs1001/1002: repeat hostname with AAAA record [dns] - 10https://gerrit.wikimedia.org/r/359199 [17:59:01] (03CR) 10Dzahn: [C: 032] add IPv6 records for releases1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/359198 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [17:59:15] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3352375 (10chasemp) a:05chasemp>03RobH Note from irc: these are closer in function to the old dataset boxes rather than existing labstores. They need t... [17:59:27] bblack: Not sure what I'm doing wrong here. FIrst thought it didn't work in hashes, but now it fails as plain variable string as well. [17:59:41] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3352377 (10Amire80) Thanks. And sorry, I forgot a few: - Special - Media [17:59:47] (03CR) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170615T1800). [18:00:05] framawiki and ebernhardson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:14] I swat [18:00:15] Krinkle: yeah I've been looking too, I'm starting to suspect our puppet doesn't support heredoc at all :( [18:00:41] bblack: http://puppet-on-the-edge.blogspot.co.uk/2014/03/heredoc-is-here.html [18:00:54] suggests 3.5 introduced it, but behind a so-called future-parser option [18:00:57] So probably 4.x [18:00:58] oh, requires future parser :( [18:00:59] ebernhardson, framawiki: About? [18:01:24] I don't think we enabled future parser tyet [18:01:30] sorry for sending you on a wild goose chase! [18:01:30] (03CR) 10Dzahn: [C: 032] "no change - just about formatting of the zone file, easier grep'ing when host names appear on each line with record" [dns] - 10https://gerrit.wikimedia.org/r/359199 (owner: 10Dzahn) [18:01:41] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3352386 (10RobH) a:05RobH>03Cmjohnson [18:01:43] i _think_ we enabled in in the puppet compiler [18:01:55] (03CR) 10Krinkle: "back to plain strings" [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [18:02:06] o/ [18:02:41] Ok, you're first then :) [18:02:42] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352404 (10Ottomata) This will be a totally different cluster than the nodes in 1001-1003, or the 1012-1022ish nodes in the analytics cluster. Can we someho... [18:02:53] (03PS3) 10Chad: Create a FeaturedFeed for the RAW bulletin on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358772 (https://phabricator.wikimedia.org/T167617) (owner: 10Framawiki) [18:02:55] RainbowSprinkles: yup [18:02:59] (03CR) 10Chad: [C: 032] Create a FeaturedFeed for the RAW bulletin on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358772 (https://phabricator.wikimedia.org/T167617) (owner: 10Framawiki) [18:03:28] (03PS3) 10Dzahn: releases: add new role/profile, add backups, install jenkins [puppet] - 10https://gerrit.wikimedia.org/r/359089 (https://phabricator.wikimedia.org/T164030) [18:05:58] * RainbowSprinkles twiddles thumbs [18:06:17] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3352434 (10Benoit_Rochon) No problem : - Special = Kotakahi - Media = Tipatc... [18:06:25] o/ mutante I have zppix helping us look into why we didn't get an icinga ping about some 500s for our recent issue with ORES. Could you help him get familiar with our prod icinga configs? [18:06:35] * halfak isn't very familiar [18:06:37] or i'd do it :\ [18:06:58] mutante: o/ can you tell me how you have prod ores setup for checks? [18:07:48] (03PS18) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [18:08:14] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3352438 (10Amire80) Also, a couple of things I noticed: - Definitely an error: "T... [18:08:25] halfak: a little bit, while i'm familiar with icinga, i am not familiar with this specific check that talks to graphite. it's more about the logs being in there i think [18:08:57] mutante, I don't think this one talks to graphite. I think this one just tries to hit the service. [18:09:07] 1/10 requests was getting a 500 for a while. [18:09:14] Maybe the check just needs to happen more often :/ [18:09:16] i thought we were talking about alerts on the rate of 5xx [18:09:17] (03CR) 10Krinkle: "Re-applied on beta and through puppet-compiler - https://puppet-compiler.wmflabs.org/6793/" [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [18:09:26] mutante: question are the checks you guys run public? [18:09:28] mutante, could be that's what we need. [18:09:33] framawiki, ebernhardson: Hang on a bit, zuul's a little backed up [18:10:07] Zppix: there are 2 parts to it, i can speak for the icinga part, yes, that is public in the operations/puppet repo [18:10:11] Zppix, it could be that changeprop can help us with this. [18:10:25] changeprop would be the way we'd track the rate of 500s since it hits ORES constantly. [18:10:28] mutante: but not per server no? [18:10:31] Like 5-10 times per second. :) [18:10:40] (03Merged) 10jenkins-bot: Create a FeaturedFeed for the RAW bulletin on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358772 (https://phabricator.wikimedia.org/T167617) (owner: 10Framawiki) [18:10:54] o/ Pchelolo! Got a minute to talk about ChangeProp and icinga? [18:10:56] (03CR) 10jenkins-bot: Create a FeaturedFeed for the RAW bulletin on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358772 (https://phabricator.wikimedia.org/T167617) (owner: 10Framawiki) [18:11:01] Zppix: icinga checks are usually in role classes, (or profiles now), then these are applied to groups of nodes in site.pp [18:11:17] mutante, oh! in the puppet repo! [18:11:21] * halfak looks for that [18:11:38] if you grep the puppet repo for "monitoring::service" for example [18:11:39] RECOVERY - puppet last run on db1099 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [18:11:43] you will get a bunch of icinga checks [18:11:45] mutante: Ok, is it possible to maybe increate the rate of checks for OREs? [18:11:57] Zppix: which specific one? [18:12:08] yea, in general it would be [18:12:08] halfak: which were getting 500s? [18:12:35] !log demon@tin Synchronized wmf-config/FeaturedFeedsWMF.php: T167617 (duration: 00m 44s) [18:12:38] Zppix, a lot of the requests *from* changeprop were getting 500s. Those should *never* get 500s. [18:12:43] halfak: I'm having a sick day so it won't be a very productive talk on my end, but ye, sure [18:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:45] T167617: Create a FeaturedFeed for the frwiki RAW monthly bulletin - https://phabricator.wikimedia.org/T167617 [18:12:58] Pchelolo, no way dude. Have your sick day. Thanks for letting us know :) [18:13:10] halfak: nono, it's all right [18:13:20] RainbowSprinkles: it works, thanks :) [18:13:35] Second file going now [18:13:42] mutante i'd assume we would increase them on changeprop, for a start hopefully that will do. halfak what you think? [18:13:44] Pchelolo, OK just checking about how we might get icinga pings from ChangeProp. ChangeProp hits ORES really fast and it should never get a 500. [18:14:06] Recently we had an issue where ChangeProp was getting 1/10 responses as a 500 [18:14:13] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: T167617 (duration: 00m 44s) [18:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:23] Zppix: what do you mean? increase the frequency a check runs? [18:14:26] We didn't notice it until an hour after it started because icinga somehow wasn't getting the 500. [18:14:43] RainbowSprinkles: yeah ! https://fr.wikipedia.org/w/api.php?action=featuredfeed&feed=raw&feedformat=atom [18:14:44] So I thought that maybe ChangeProp could say something when it gets a 500 [18:14:51] Zppix: how to find existing ORES icinga checks in puppet repo: ~/puppet$ grep -r ores * | grep monitor [18:15:07] framawiki: We all good? [18:15:08] :) [18:15:10] Zppix, mutante https://github.com/wikimedia/puppet/blob/e959321aa620b77403cc9379db2e86080323c6e8/modules/icinga/manifests/monitor/ores.pp [18:15:12] This ^ ? [18:15:15] yep, thanks :) [18:15:16] halfak: the easiest thing to do is to set up alerts from grafana on the retry topic rate [18:15:34] Pchelolo, Gotcha. Any examples or docs you could point me to? [18:15:49] halfak: yea, those are the existing Icinga checks for ORES, yep [18:16:03] it's checking if the home page is up [18:16:14] if the workers respond [18:16:15] ebernhardson: Ok you're next [18:16:18] (03PS1) 10Faidon Liambotis: Revert "Move all ssl certs to the module and out of files/" [puppet] - 10https://gerrit.wikimedia.org/r/359211 [18:16:24] halfak: that's something we wanted to do for RESTBase. There's some details on this ticket: https://phabricator.wikimedia.org/T162765 [18:16:50] !log demon@tin Synchronized php-1.30.0-wmf.5/includes/libs/objectcache/MultiWriteBagOStuff.php: T167465 (duration: 00m 44s) [18:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:59] mutante: but it doesnt check changeprop... if i understand aaron right thats what he wants [18:16:59] T167465: "Key contains invalid characters" when using MultiWriteBagOStuff - https://phabricator.wikimedia.org/T167465 [18:17:02] halfak: Zppix: what Pchelolo said, i think there is just no existing check for 5xx in graphite [18:17:12] Zppix: i guess so, yea [18:17:20] (03PS2) 10Faidon Liambotis: Revert "Move all ssl certs to the module and out of files/" [puppet] - 10https://gerrit.wikimedia.org/r/359211 [18:17:21] RainbowSprinkles: wonderful! [18:17:30] mutante: can it be setup? [18:17:48] Zppix: i'm sure it can, but i haven't done it [18:17:49] ebernhardson: Any preference to order? [18:18:10] mutante: who usually handles that, or i or aaron can open a task. [18:18:17] Zppix: there is "check_graphite" [18:18:28] Zppix: task is the way :) [18:18:37] https://phabricator.wikimedia.org/T167830 [18:18:38] Krinkle: Cache key error appears to be disappearing [18:18:39] mutante, ^ [18:18:49] That task is where I'm pasting notes now :) [18:18:55] sorry, i cant login to phab right now, but i will be soon again [18:18:58] 10Operations, 10Services (blocked): Set up grafana alerting for services - https://phabricator.wikimedia.org/T162765#3352513 (10Pchelolo) @Halfak pointed out that using this to check the retry topic rate for ORES would help them [18:19:01] sounds good [18:20:10] (03CR) 10Chad: [C: 032] Enable token_count_router for cirrus queries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359183 (https://phabricator.wikimedia.org/T152094) (owner: 10EBernhardson) [18:20:24] RainbowSprinkles: can ship them together. [18:20:33] Zppix: search for "check_graphite_" there is more than one, check_graphite_threshold, check_graphite_anomaly, check_graphite_freshness.. i think it's one of those [18:20:36] 10Operations, 10Services (blocked): Set up grafana alerting for services - https://phabricator.wikimedia.org/T162765#3352524 (10Halfak) Our task: {T167830} Essentially intermittent 500s were not getting caught by the icinga check quickly enough so we'd like to catch it via #ChangeProp since it hits ORES real... [18:20:41] (03CR) 10Faidon Liambotis: [C: 032] Revert "Move all ssl certs to the module and out of files/" [puppet] - 10https://gerrit.wikimedia.org/r/359211 (owner: 10Faidon Liambotis) [18:20:43] thanks Pchelolo. We'll look into it and share notes if we make progress. [18:21:06] great halfak sorry I can't be more helpful today :( [18:21:09] Zppix, I'm going to head out and have a (very late) lunch. I'll leave the discussion to you. Make sure to add notes to the task OK? [18:21:13] mutante: will do let me continue looking into this and i may end up asking for additonal services for icinga being added [18:21:19] halfak: ack [18:21:28] No worries Pchelolo. I hope you feel better tomorrow :) [18:21:30] RECOVERY - puppet last run on db1101 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:21:43] (03PS1) 10Ottomata: Insert select eventbus generated event topics into eventlogging MySQL database [puppet] - 10https://gerrit.wikimedia.org/r/359212 (https://phabricator.wikimedia.org/T150369) [18:21:54] RainbowSprinkles: nice [18:22:04] Errr. [18:22:07] Maybe I spoke too soon [18:22:13] ill get you updated Pchelolo , mutante [18:22:16] keep* [18:22:17] Definitely gone down [18:22:32] Oh no, gone [18:22:35] I can't read a graph [18:22:57] (03CR) 10jerkins-bot: [V: 04-1] Insert select eventbus generated event topics into eventlogging MySQL database [puppet] - 10https://gerrit.wikimedia.org/r/359212 (https://phabricator.wikimedia.org/T150369) (owner: 10Ottomata) [18:23:07] (03Merged) 10jenkins-bot: Enable token_count_router for cirrus queries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359183 (https://phabricator.wikimedia.org/T152094) (owner: 10EBernhardson) [18:23:18] (03PS2) 10Chad: [cirrus] remove elastic quirks after elastic 5.3 upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353100 (owner: 10DCausse) [18:23:22] (03CR) 10Chad: [C: 032] [cirrus] remove elastic quirks after elastic 5.3 upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353100 (owner: 10DCausse) [18:24:09] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.eqiad.wmnet.crt] [18:24:16] 10Operations, 10Services (blocked): Set up grafana alerting for services - https://phabricator.wikimedia.org/T162765#3352531 (10GWicke) @pchelolo, @halfak: If you want to go the grafana route for ease of modification, I would recommend to set up a separate dashboard for ORES retries. You can then make that das... [18:24:19] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:24:19] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:24:29] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:24:29] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:24:30] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:24:30] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/rendering.svc.eqiad.wmnet.crt] [18:24:30] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/ldap-labs.codfw.wikimedia.org.crt] [18:24:37] Yay race as it undoes itself :) [18:24:38] hehe [18:24:39] PROBLEM - puppet last run on mw2139 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [18:24:40] Are we having database switchover right now? On Commons, I'm getting: This wiki is in read-only mode for a datacenter switchover test. See https://meta.wikimedia.org/wiki/codfw for more information. [18:24:49] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.eqiad.wmnet.crt] [18:24:49] PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:24:49] PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:24:49] PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/rendering.svc.eqiad.wmnet.crt] [18:24:52] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.eqiad.wmnet.crt] [18:25:09] and the linked page does not list today as a switchover date [18:25:09] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:25:09] PROBLEM - puppet last run on mw2239 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:25:09] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 3 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:25:09] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.eqiad.wmnet.crt] [18:25:14] (03Merged) 10jenkins-bot: [cirrus] remove elastic quirks after elastic 5.3 upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353100 (owner: 10DCausse) [18:25:19] PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:25:22] zhuyifei1999_: Um, no? Possibly transient? [18:25:29] PROBLEM - puppet last run on mw2121 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [18:25:33] I'm not seeing that [18:25:39] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.eqiad.wmnet.crt] [18:25:40] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:25:40] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:25:40] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/etcd.eqiad.wmnet.crt] [18:25:49] hmm, /me looks [18:25:49] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/rendering.svc.eqiad.wmnet.crt] [18:25:49] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [18:25:49] PROBLEM - puppet last run on mw2188 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:25:49] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.eqiad.wmnet.crt] [18:26:09] PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.eqiad.wmnet.crt] [18:26:11] oh oops, X-wikimedia-debug [18:26:19] PROBLEM - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/labvirt-star.codfw.wmnet.crt] [18:26:19] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/labvirt-star.eqiad.wmnet.crt] [18:26:29] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:26:30] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:26:39] PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:26:39] PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:26:39] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.eqiad.wmnet.crt] [18:26:49] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/mail.wikimedia.org.crt] [18:26:49] PROBLEM - puppet last run on mw2255 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:26:49] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.eqiad.wmnet.crt] [18:26:49] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [18:26:59] PROBLEM - puppet last run on cp1008 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:27:09] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.eqiad.wmnet.crt] [18:27:09] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt] [18:27:19] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt] [18:27:19] PROBLEM - puppet last run on mw2017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:27:19] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.eqiad.wmnet.crt] [18:27:20] PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:27:29] PROBLEM - puppet last run on mw2252 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [18:27:30] !log demon@tin Synchronized wmf-config/CirrusSearch-common.php: Remove quirks and enable token_count_router thingie (duration: 00m 44s) [18:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:39] PROBLEM - puppet last run on mw2140 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [18:27:39] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:27:43] ebernhardson: Done ^ [18:27:49] PROBLEM - puppet last run on mw2175 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:27:49] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [18:27:49] PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:28:08] RainbowSprinkles: checking [18:28:09] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt] [18:28:09] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt] [18:28:09] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:28:19] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt] [18:28:29] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt] [18:28:59] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.eqiad.wmnet.crt] [18:29:09] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:29:10] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt] [18:29:19] PROBLEM - puppet last run on mw2181 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:29:29] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:29:39] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/labvirt-star.eqiad.wmnet.crt] [18:29:39] PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/etcd.codfw.wmnet.crt] [18:29:49] PROBLEM - puppet last run on mw2196 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:29:49] PROBLEM - puppet last run on mw2134 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [18:30:09] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/ldap-labs.eqiad.wikimedia.org.crt] [18:30:09] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt] [18:30:09] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/labvirt-star.eqiad.wmnet.crt] [18:30:09] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:30:49] PROBLEM - puppet last run on mw2254 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:30:50] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:30:59] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:31:19] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt] [18:31:19] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.eqiad.wmnet.crt] [18:31:39] PROBLEM - puppet last run on mw2132 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [18:31:49] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt] [18:31:49] PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [18:31:49] PROBLEM - puppet last run on mw2100 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:32:09] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt] [18:32:14] RainbowSprinkles: looks to at least not be causing problems. Digging through tons of elasticsearch explain output (damn is it verbose...) to figure out if it's doing as intended [18:32:19] PROBLEM - puppet last run on mw2133 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [18:32:50] PROBLEM - puppet last run on mw2240 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:33:09] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.eqiad.wmnet.crt] [18:33:29] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:33:29] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt] [18:33:39] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt] [18:33:39] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:34:09] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.eqiad.wmnet.crt] [18:34:19] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt] [18:34:29] PROBLEM - puppet last run on mw2122 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [18:34:39] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.codfw.wmnet.crt] [18:34:41] 10Operations, 10Traffic: uploads.wm.o commons archive 20170615014039!Adsalm.webm visible despite file deleted on Commons - https://phabricator.wikimedia.org/T168002#3352611 (10zhuyifei1999) [18:34:49] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/ldap-corp.eqiad.wikimedia.org.crt] [18:34:49] PROBLEM - puppet last run on mw2195 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:34:55] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352624 (10RobH) @Ottomata that is not how we denote different clusters for any other hostnames on the cluster, so it seems bad to have kafka/analytics diffe... [18:35:09] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.eqiad.wmnet.crt] [18:35:19] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt] [18:35:35] 10Operations, 10Analytics, 10WMF-Legal, 10Privacy: Honor DNT header for access logs & varnish logs - https://phabricator.wikimedia.org/T98831#3352626 (10Nuria) FYI, Our privacy policy does mention we do not honor DNT. [18:35:39] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:35:50] (03CR) 10jenkins-bot: Enable token_count_router for cirrus queries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359183 (https://phabricator.wikimedia.org/T152094) (owner: 10EBernhardson) [18:35:52] (03CR) 10jenkins-bot: [cirrus] remove elastic quirks after elastic 5.3 upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353100 (owner: 10DCausse) [18:35:59] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-ecdsa-unified.crt],File[/etc/ssl/localcerts/globalsign-2016-rsa-unified.crt] [18:36:09] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/globalsign-2016-ecdsa-unified.crt] [18:36:49] PROBLEM - puppet last run on mw2199 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:37:19] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/digicert-2016-rsa-unified.crt] [18:39:39] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/appservers.svc.codfw.wmnet.crt] [18:40:40] !log temporarily stopping icinga-wm [18:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:38] 10Operations, 10Analytics, 10WMF-Legal, 10Privacy: Honor DNT header for access logs & varnish logs - https://phabricator.wikimedia.org/T98831#3352638 (10Nuria) https://wikimediafoundation.org/wiki/Privacy_policy/FAQ#DNTFAQ , if we were to do it i just found recently about the w3 api on this regard: https:/... [18:43:01] gwicke: got a moment? [18:44:45] !log restarting all puppetmasters [18:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:13] (03CR) 10Bmansurov: [C: 031] Remove dead config variable MinervaPrintStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359043 (https://phabricator.wikimedia.org/T166408) (owner: 10Jdlrobson) [18:49:19] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3352669 (10Benoit_Rochon) Right, my bad. Atikamekw is very old and is particular... [18:50:34] (03CR) 10Bmansurov: [C: 031] Remove Cards from the cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359045 (https://phabricator.wikimedia.org/T167452) (owner: 10Jdlrobson) [18:50:36] (03PS4) 10Hashar: swift: lower replication interval for beta [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) [18:51:31] (03PS1) 10Faidon Liambotis: ssl: cleanup a bunch of expired/obsolete certs [puppet] - 10https://gerrit.wikimedia.org/r/359221 [18:51:34] bblack: ^ [18:51:44] bblack: the last two are a bit more controversial I suppose [18:51:54] I'm at least ambivalent about them [18:52:02] they're still active/unexpired, but we don't use them [18:52:16] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352673 (10RobH) IRC Update: Otto is going to chat with the rest of the folks involved in analytics, but we're leaning towards the following: kafka1XXX => d... [18:53:14] Can I do a very fast deploy of ores, simple config change, definitely won't break anything and fixes 404s? https://ores.wikimedia.org/foo [18:53:52] that's a question for RainbowSprinkles/greg-g I suppose? [18:54:16] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352682 (10RobH) If we cannot settle on hostnames before Chris goes to rack, we can set these up with asset tag mgmt dns entries only, and not put the hostna... [18:54:33] Amir1: Go ahead and do it now, I'd rather you do it before the train and not after [18:54:37] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352683 (10Ottomata) > Need input from @Ottomata on which vlans these 6 new hosts will use, as it will help determine row. Not in analytics vlan. These shou... [18:55:05] (03PS5) 10Hashar: swift: lower replication interval for beta [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) [18:55:11] Thanks! [18:58:14] !log ladsgroup@tin Started deploy [ores/deploy@ab88a74]: Deploying gerrit:359224/1 for missing config variables [18:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:42] 10Operations, 10Traffic: uploads.wm.o commons archive 20170615014039!Adsalm.webm visible despite file deleted on Commons - https://phabricator.wikimedia.org/T168002#3352693 (10zhuyifei1999) Also T129845#3351290: https://upload.wikimedia.org/wikipedia/commons/archive/3/3e/20170615122111%2120170615014039%21Youne... [19:00:04] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170615T1900). [19:02:34] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install kafka100[4-9].eqiad.wmnet - https://phabricator.wikimedia.org/T167992#3352700 (10Ottomata) I'll have to check with @elukey to finalize a cluster name. That'll have to wait til next week, sorry. FYI, our brainbounce of names i... [19:04:41] It's going on, super super slow [19:04:50] fetch stage(s): 22% (ok: 2; fail: 0; left: 7) [19:05:05] I think it doesn't use parallel connections [19:05:23] (03CR) 10Hashar: "PS5 bumps the replication delay further" [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:06:04] (03PS1) 10Chad: group2 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359225 [19:07:33] Amir1: Fetch should be in parallel [19:07:50] RainbowSprinkles: the config of "batch_size" is one [19:07:52] (03CR) 10Dzahn: [C: 032] "nitpick done" [puppet] - 10https://gerrit.wikimedia.org/r/359089 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [19:07:53] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-20_(1.30.0-wmf.6)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3352715 (10Amire80) >>! In T167714#3352669, @Benoit_Rochon wrote: > Right, my bad... [19:07:58] (03PS4) 10Dzahn: releases: add new role/profile, add backups, install jenkins [puppet] - 10https://gerrit.wikimedia.org/r/359089 (https://phabricator.wikimedia.org/T164030) [19:08:04] I don't know if that changes it [19:09:05] batch_size of 1? Yeah that's sequential [19:09:07] Default is 80? [19:09:12] According to docs [19:09:24] Better question for thcipriani [19:09:34] yeah, but ores has ten nodes only [19:09:47] (03CR) 10Dzahn: [C: 032] "used on new VM (originally mwreleases), current releases on bromine is unaffacted as of now but the plan is to move it here as well" [puppet] - 10https://gerrit.wikimedia.org/r/359089 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [19:09:48] I guess how many would you want to do at the same time? [19:09:49] :) [19:09:53] so batch size should be around 5 ish I gues [19:10:02] *guess [19:10:26] well, it's extremely slow, last time took around half an hour to finish the deployment [19:10:38] hrm, yeah, batch_size is probably not what you want to mess with if you want to deploy in stages. It's more like concurrent ssh connections than a true batch size [19:11:15] if you want to divide deploy into 2 groups you could do something like group_size: 5 [19:11:30] it would deploy fully to 5 servers, then fully to another 5 servers [19:11:55] with a batch_size of 1, yes, each process (fetch, promote, service restart) will happen sequentially [19:12:09] but it will happen on all servers at the same time [19:12:26] okay I understand it now [19:12:28] if that makes sense :) [19:13:14] it does [19:14:31] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install new kafka nodes - https://phabricator.wikimedia.org/T167992#3352759 (10RobH) [19:15:27] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install new kafka nodes - https://phabricator.wikimedia.org/T167992#3352337 (10RobH) [19:15:55] 10Operations, 10Discovery, 10Maps, 10Traffic, 10Interactive-Sprint: Rate-limit browsers without referers - https://phabricator.wikimedia.org/T154704#2921080 (10debt) @Gehel will chat with the #traffic team about this. [19:17:24] !log Re-enabled link between cr2-codfw and cr1-eqdfw - T167261 [19:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:33] T167261: Faulty link between cr2-codfw and cr1-eqdfw - https://phabricator.wikimedia.org/T167261 [19:17:36] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install new kafka nodes - https://phabricator.wikimedia.org/T167992#3352801 (10RobH) [19:17:39] https://ores.wikimedia.org/example [19:18:04] it's 55% on restart_service_stages [19:18:11] (03PS1) 10Dzahn: jenkins: add stretch support, no more jdk7, use jdk8 [puppet] - 10https://gerrit.wikimedia.org/r/359227 [19:22:29] !log ladsgroup@tin Finished deploy [ores/deploy@ab88a74]: Deploying gerrit:359224/1 for missing config variables (duration: 24m 15s) [19:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:38] done [19:23:58] 10Operations, 10Traffic, 10Interactive-Sprint, 10Maps (Kartographer), 10Regression: Map tiles load way slower than before - https://phabricator.wikimedia.org/T167046#3352835 (10debt) 05Open>03Invalid We haven't been able to reproduce this, closing. [19:27:03] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-20_(1.30.0-wmf.6)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3352838 (10Benoit_Rochon) All good then. So we revied a few namespaces. Here is t... [19:29:01] (03PS1) 10BBlack: ratelimits: double the default anon limit [puppet] - 10https://gerrit.wikimedia.org/r/359231 (https://phabricator.wikimedia.org/T163233) [19:29:51] Amir1: Thx [19:30:35] (03CR) 10BBlack: [C: 032] ratelimits: double the default anon limit [puppet] - 10https://gerrit.wikimedia.org/r/359231 (https://phabricator.wikimedia.org/T163233) (owner: 10BBlack) [19:30:43] (03CR) 10Chad: [C: 032] group2 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359225 (owner: 10Chad) [19:31:50] (03Merged) 10jenkins-bot: group2 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359225 (owner: 10Chad) [19:32:03] (03CR) 10jenkins-bot: group2 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359225 (owner: 10Chad) [19:35:02] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.5 [19:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:24] (03CR) 10Paladox: [C: 031] "Looks like a no op. Untested as i doint use the jenkins prod class but use content which does what it is doing here and it worked." [puppet] - 10https://gerrit.wikimedia.org/r/359227 (owner: 10Dzahn) [19:43:43] (03PS2) 10Dzahn: jenkins: add stretch support, no more jdk7, use jdk8 [puppet] - 10https://gerrit.wikimedia.org/r/359227 [19:44:02] (03CR) 10Chad: "I could've sworn we did this already..." [puppet] - 10https://gerrit.wikimedia.org/r/359227 (owner: 10Dzahn) [19:44:22] (03CR) 10Paladox: [C: 031] jenkins: add stretch support, no more jdk7, use jdk8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/359227 (owner: 10Dzahn) [19:44:29] (03CR) 10Chad: "Ignore me, I was confused" [puppet] - 10https://gerrit.wikimedia.org/r/359227 (owner: 10Dzahn) [19:44:58] (03CR) 10Dzahn: "yep @ paladox, i use the exact same code we have in modules/jenkins/manifests/slave/requisites.pp already" [puppet] - 10https://gerrit.wikimedia.org/r/359227 (owner: 10Dzahn) [19:45:28] (03CR) 10Dzahn: "we did in modules/jenkins/manifests/slave/requisites.pp , 2 places in jenkins module" [puppet] - 10https://gerrit.wikimedia.org/r/359227 (owner: 10Dzahn) [19:45:47] (03CR) 10Dzahn: "slave vs master" [puppet] - 10https://gerrit.wikimedia.org/r/359227 (owner: 10Dzahn) [19:46:10] (03CR) 10Chad: jenkins: add stretch support, no more jdk7, use jdk8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/359227 (owner: 10Dzahn) [19:47:14] (03CR) 10Paladox: [C: 031] jenkins: add stretch support, no more jdk7, use jdk8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/359227 (owner: 10Dzahn) [19:47:38] (03CR) 10Dzahn: jenkins: add stretch support, no more jdk7, use jdk8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/359227 (owner: 10Dzahn) [19:48:54] (03CR) 10Chad: jenkins: add stretch support, no more jdk7, use jdk8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/359227 (owner: 10Dzahn) [19:50:14] (03CR) 10BBlack: [C: 031] ssl: cleanup a bunch of expired/obsolete certs [puppet] - 10https://gerrit.wikimedia.org/r/359221 (owner: 10Faidon Liambotis) [19:52:24] (03CR) 10Chad: [C: 031] ssl: cleanup a bunch of expired/obsolete certs [puppet] - 10https://gerrit.wikimedia.org/r/359221 (owner: 10Faidon Liambotis) [19:59:18] 10Operations, 10netops: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842#3352909 (10ayounsi) I don't have a great visibility on how our ip space is divided, but looking at DNS it looks like for example 208.80.154.200 could be a good choice. Close to the eqiad loopback IPs. Once we have an IP, I... [20:05:47] 10Operations, 10netops: Faulty link between cr2-codfw and cr1-eqdfw - https://phabricator.wikimedia.org/T167261#3352918 (10ayounsi) 05Open>03Resolved Circuit has been back to the same levels of traffic as before the issue for 1h, no more issues reported in Icinga or Smokeping. [20:13:31] godog: could you have a look at T168002 ? the longer the file stays, the worse the impact, as the pirates will be more willing to abuse commons [20:13:31] T168002: uploads.wm.o commons archive 20170615014039!Adsalm.webm visible despite file deleted on Commons - https://phabricator.wikimedia.org/T168002 [20:14:46] (03CR) 10Ottomata: [C: 04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/359212 (https://phabricator.wikimedia.org/T150369) (owner: 10Ottomata) [20:16:29] 10Operations, 10netops: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842#3346542 (10BBlack) Multicast has its uses in general. Even if we kill HTCP another use may pop up. I did quick survey to try to find active uses. Filtering for just v4, removing the standard references you'd see to 224.... [20:18:34] mutante: do you know what grafana's checkrate is (time between checks) [20:19:32] Zppix: depends from where the data comes from (different backends) [20:20:13] Zppix: on the icinga side it depends on the "check_interval" setting [20:20:49] mutante: so that can be custom? [20:21:01] Zppix: yes, and if it's not set i think the default is 5m [20:21:04] ah, you mean alarms on icinga based on grafana's data... [20:21:37] maybe both [20:21:38] volans: yes sorry I'm trying to keep this all together in my head... [20:21:59] it could be both, how often does icinga ask grafana and how often does data in grafana get updated [20:23:15] I mean if you know both [20:23:25] i just know the icinga part [20:23:27] any info i can get is helpful [20:24:11] i don't think it's about changing the interval of an existing check, there is just no exiting check.. or ? [20:24:20] i mean, there is one [20:24:34] but that doesn't ask grafana or looks at 5xx rates [20:24:47] it connects directly to some URLs [20:25:46] mutante: Right now im gathering info on what the best data source for gathering 5xx errors would be wether direct icinga or grafana but i think we might go grafana but i still have to talk to people to see what they think first. [20:26:58] ok [20:27:13] are we talking about the usually-laggy 5xx alerts we see from icinga here? [20:27:22] bblack: no ores 5xx alerts [20:27:52] see T167830 [20:27:52] T167830: Extend icinga check to catch 500 errors like those of the 20170613 incident - https://phabricator.wikimedia.org/T167830 [20:29:34] cause on 06-13 19:00 utc and today at 19:00 utc is a huge difference [20:31:01] Zppix: the only thing that would've already alerted on this, probably, is the general misc-cluster 5xx alerting [20:31:10] which looks at graphite data which also drives this graph: [20:31:12] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5&from=now-7d&to=now [20:31:39] you can see the broad spike of 500s there on the 13th [20:31:56] but that covers all of the misc cluster as a whole, which has many services aside from ores [20:32:06] bblack: see https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1497366649350&to=1497383640251 [20:43:23] 10Operations, 10MobileFrontend, 10Traffic: Remove disableImages handling from VCL - https://phabricator.wikimedia.org/T168013#3352978 (10MaxSem) [20:55:16] (03CR) 10Chad: [C: 032] Remove Cards from the cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359045 (https://phabricator.wikimedia.org/T167452) (owner: 10Jdlrobson) [20:57:29] !log upgrading RT (request tracker) [20:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:01] (03PS10) 10BryanDavis: bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 [20:58:03] (03PS1) 10BryanDavis: Add vim-scripts as a standard package [puppet] - 10https://gerrit.wikimedia.org/r/359304 [20:58:05] (03PS2) 10Chad: Remove Cards from the cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359045 (https://phabricator.wikimedia.org/T167452) (owner: 10Jdlrobson) [21:02:58] does anyone know if changeprop has a graphite/grafana metric? [21:05:38] (03PS1) 10BryanDavis: WMCS: add wmcs-admin to labsdb100[13] [puppet] - 10https://gerrit.wikimedia.org/r/359323 (https://phabricator.wikimedia.org/T166310) [21:05:48] Hey folks. What would it take to get zppix access to prod graphite so he can work on some stuff with us? [21:05:55] If that a whole NDA dance or just a quick perm? [21:06:43] halfak: the whole NDA dance [21:06:50] Damn damn damn [21:06:51] k [21:06:57] If its behind ldap auth [21:07:13] He has an NDA? [21:07:20] Zppix, does? [21:07:20] For no purpose, if I remember the conversation the other day :p [21:07:24] lol [21:07:27] We have purpose! [21:07:27] Or that he wanted it for no purpose [21:07:30] whateverrrrrr [21:07:38] halfak: Don't get your hopes up :p [21:07:42] He'll help you until he gets bored [21:07:50] Then off to the next new toy [21:08:00] [21:08:02] RainbowSprinkles: really, thats nice to know how you think of me [21:08:05] That's how it works :) [21:08:07] !log demon@tin Started scap: Removing Cards extension [21:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:37] I have a lot of volunteers work with my team for short periods of time :) [21:08:41] Zppix: I mean, you hop around from one project to another :) [21:08:46] In fact I'm the only non-volunteer on my team [21:08:50] \o/ [21:08:52] Long term support doesn't seem to be your forte, that's all :) [21:08:56] Or wait. I meant :( [21:09:05] lol halfak [21:17:59] 10Operations, 10Ops-Access-Requests, 10User-Zppix: Graphite access for Zppix - https://phabricator.wikimedia.org/T168014#3353061 (10Zppix) [21:19:27] 10Operations, 10Ops-Access-Requests, 10User-Zppix: Graphite access for Zppix - https://phabricator.wikimedia.org/T168014#3353091 (10Halfak) I'd like to have Zppix access graphite so he can help us with some icinga monitoring setup in #ores [21:19:39] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team, 10User-Zppix: Graphite access for Zppix - https://phabricator.wikimedia.org/T168014#3353092 (10Halfak) [21:20:46] (03CR) 10jenkins-bot: Remove Cards from the cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359045 (https://phabricator.wikimedia.org/T167452) (owner: 10Jdlrobson) [21:22:57] Cards extension? [21:23:51] Esther: i think they were what popups replaced atleast thats what i assume i dont know there was a cards ext [21:24:30] Oh, Hovercards? [21:24:37] I was thinking TwitterCards. [21:25:24] idk [21:26:15] Esther: https://www.mediawiki.org/wiki/Extension:Cards [21:26:41] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team, 10User-Zppix: Graphite access for Zppix - https://phabricator.wikimedia.org/T168014#3353061 (10jcrespo) BTW, graphite access is already public: https://graphite.wikimedia.org/render/?target=ores.scb1001.precache_cache_hit.count https://graphite... [21:28:00] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team, 10User-Zppix: Graphite access for Zppix - https://phabricator.wikimedia.org/T168014#3353137 (10Zppix) @jcrespo when going to graphite.wikimedia.org i get a prompt to login i try every wmf login i have (non wikitech and wikitech and shell ) and n... [21:29:57] !log demon@tin Finished scap: Removing Cards extension (duration: 21m 49s) [21:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:11] !log re-enabled puppet and force run to re-enable ircecho on einstenium [21:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:03] !log volans@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2251.codfw.wmnet [21:39:11] RainbowSprinkles: ^^^ [21:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:23] volans: tyvm [21:40:45] mutante: if you're not too busy can I handover to you mw2251? seems down since 15m on icinga [21:42:50] volans: seems you already depooled it? [21:42:57] yes just did [21:43:05] PROBLEM - jenkins_service_running on releases1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [21:43:26] it needs to see the console to understand why it went down, possibly restart it and re-sync before re-pooling [21:43:29] to ensure it has the latest code [21:43:45] PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:58] volans: ok [21:44:08] thanks a lot! [21:45:35] RECOVERY - PyBal backends health check on lvs3001 is OK: PYBAL OK - All pools are healthy [21:45:57] !log powercycling mw2251 (frozen console) [21:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:05] RECOVERY - Host mw2251 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [21:48:35] PROBLEM - High lag on wdqs2001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1800.0] [21:49:58] mutante: to you know if the alarm for scb1001 pdfrender is known/worked upon? [21:50:16] * volans checking what else we missed with icinga-wm on holiday ;) [21:50:59] volans: i know we had an issue on it the othery day and pdfrender got limited to use max 2G memory. i dont know anything about new alerts today [21:54:47] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-20_(1.30.0-wmf.6)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3353241 (10Benoit_Rochon) It's very nice, they will do an official launch, open t... [21:56:06] on mw2251 - syslog simply ends and then starts again when i powercycled it [21:56:19] no obvious error right before it shut down [21:56:22] ok for pdfrender, having a look [21:56:28] we had those cases before.. hrm [21:57:13] i'll take the releases1001 . that's me [21:57:15] there is any HHVM core dump? spikes in the graphs? (check the prometheous ones, more frequent data ;) ) [21:57:33] for mw2251 that is, ofc [21:58:25] ACKNOWLEDGEMENT - Check systemd state on releases1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn new install - https://gerrit.wikimedia.org/r/#/c/359227/2 will fix [21:58:25] ACKNOWLEDGEMENT - jenkins_service_running on releases1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war daniel_zahn new install - https://gerrit.wikimedia.org/r/#/c/359227/2 will fix [21:58:26] ACKNOWLEDGEMENT - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openjdk-7-jdk] daniel_zahn new install - https://gerrit.wikimedia.org/r/#/c/359227/2 will fix [22:01:55] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [22:02:40] !log restarted pdfrender on scb1001 T159922 [22:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:50] T159922: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922 [22:06:34] mutante: pdfrender seems happier after a restart, thanks for releases1001, the last one is restbase2001, might that one might be known, I don't remember [22:07:13] what about "kubestage" [22:07:46] that's alex [22:07:52] still WIP AFAIK [22:08:10] anyway nothing in production there, still AFAIK ;) [22:08:11] sounds good @ pdfrender, cool [22:08:14] ok :) [22:08:22] "stage" sounds like i shouldnt worry anyways [22:08:26] but let's ACK it [22:09:06] ACKNOWLEDGEMENT - Check systemd state on kubestage1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn stage, WIP [22:09:06] ACKNOWLEDGEMENT - Check systemd state on kubestage1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn stage, WIP [22:09:36] ohh that's easy, puppet service [22:09:38] let me fix it [22:09:52] systemctl reset-failed? [22:10:09] yeah, sudo systemctl reset-failed puppet [22:10:32] would core dumps be in /tmp? [22:10:35] RECOVERY - Check systemd state on kubestage1002 is OK: OK - running: The system is fully operational [22:10:47] i dont see anything so far [22:10:55] in the graphs [22:11:05] RECOVERY - Check systemd state on kubestage1001 is OK: OK - running: The system is fully operational [22:11:06] it's just a gap [22:11:14] ah :) [22:13:16] let me just fix release1001 too.. i was just going to compile the change that does it [22:18:44] (03PS3) 10Dzahn: jenkins: add stretch support, no more jdk7, use jdk8 [puppet] - 10https://gerrit.wikimedia.org/r/359227 [22:18:59] (03CR) 10Dzahn: [C: 032] "no-op on contint1001/2001 - should fix releases1001" [puppet] - 10https://gerrit.wikimedia.org/r/359227 (owner: 10Dzahn) [22:23:35] RECOVERY - High lag on wdqs2001 is OK: OK: Less than 30.00% above the threshold [600.0] [22:25:17] 10Operations, 10MobileFrontend, 10Reading-Web-Backlog, 10Traffic: Remove disableImages handling from VCL - https://phabricator.wikimedia.org/T168013#3353345 (10Jdlrobson) [22:25:59] (03PS1) 10Dzahn: jenkins: add stretch support pt2, invalid relationship [puppet] - 10https://gerrit.wikimedia.org/r/359356 [22:27:06] (03CR) 10Paladox: [C: 031] jenkins: add stretch support pt2, invalid relationship [puppet] - 10https://gerrit.wikimedia.org/r/359356 (owner: 10Dzahn) [22:27:38] (03CR) 10Dzahn: [C: 032] jenkins: add stretch support pt2, invalid relationship [puppet] - 10https://gerrit.wikimedia.org/r/359356 (owner: 10Dzahn) [22:29:25] paladox: :p .. E: Package 'jenkins' has no installation candidate [22:29:33] aha [22:29:34] hehe, well, yea [22:29:41] mutante you need to upload jenkins to stretch [22:29:42] that is after the former issue was fixed [22:29:45] yep [22:29:51] or use jessie [22:30:07] not sure how you can use jessie without downloading any other packages [22:30:13] only limit it to jenkins. [22:30:32] well,the point was to test stretch because it's going to be stable in .. 2 days [22:30:40] will see [22:30:56] ah ok [22:31:08] jenkins isne't in upstream debian repo It's in jessie-wikimedia :) [22:31:40] yea, i have uploaded that before [22:31:54] ok [22:31:56] but docs are from precise.will look [22:32:25] ok [22:40:45] PROBLEM - mediawiki-installation DSH group on mw2251 is CRITICAL: Host mw2251 is not in mediawiki-installation dsh group [22:41:53] (03PS2) 10Dzahn: WMCS: add wmcs-admin to labsdb100[13] [puppet] - 10https://gerrit.wikimedia.org/r/359323 (https://phabricator.wikimedia.org/T166310) (owner: 10BryanDavis) [22:45:51] mutante: mw2251 could be repooled? ^^^ (if you pulled the last code) [22:48:37] (03PS1) 10EBernhardson: [WIP] Add ltr-query 0.1.1 snapshot [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/359359 [22:53:03] !log restarting elasticsearch on relforge to pickup new ltr-query plugin [22:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:37] (03CR) 10BryanDavis: "Suggested by paravoid during review of I01c39f4193c4bc242c0d136a0e5100a066b6f532" [puppet] - 10https://gerrit.wikimedia.org/r/359304 (owner: 10BryanDavis) [22:56:58] !log mw2251 - scap pull [22:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:29] !log mw2251 - repooled [22:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170615T2300). [23:00:04] Jdlrobson and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:22] \o [23:02:01] i suppoes i can deploy [23:02:08] (03PS2) 10EBernhardson: [cirrus] Enable crossproject search on all wikipedias (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357641 (https://phabricator.wikimedia.org/T162276) (owner: 10DCausse) [23:02:14] (03CR) 10EBernhardson: [C: 032] [cirrus] Enable crossproject search on all wikipedias (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357641 (https://phabricator.wikimedia.org/T162276) (owner: 10DCausse) [23:02:32] jdlrobson: ping? [23:03:28] (03Merged) 10jenkins-bot: [cirrus] Enable crossproject search on all wikipedias (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357641 (https://phabricator.wikimedia.org/T162276) (owner: 10DCausse) [23:03:42] (03CR) 10jenkins-bot: [cirrus] Enable crossproject search on all wikipedias (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357641 (https://phabricator.wikimedia.org/T162276) (owner: 10DCausse) [23:05:02] ebernhardson: hey [23:07:08] jdlrobson: cool, i'll merge yours now too. testing mine atm [23:08:14] (03PS3) 10EBernhardson: Remove dead config variable MinervaPrintStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359043 (https://phabricator.wikimedia.org/T166408) (owner: 10Jdlrobson) [23:08:38] (03CR) 10EBernhardson: [C: 032] Remove dead config variable MinervaPrintStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359043 (https://phabricator.wikimedia.org/T166408) (owner: 10Jdlrobson) [23:09:42] (03Merged) 10jenkins-bot: Remove dead config variable MinervaPrintStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359043 (https://phabricator.wikimedia.org/T166408) (owner: 10Jdlrobson) [23:09:46] (03CR) 10jenkins-bot: Remove dead config variable MinervaPrintStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359043 (https://phabricator.wikimedia.org/T166408) (owner: 10Jdlrobson) [23:10:44] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Scap: T162276: Enable crossproject search (duration: 00m 51s) [23:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:55] T162276: Production release: sister project search results display - https://phabricator.wikimedia.org/T162276 [23:11:41] !log ebernhardson@tin Started scap: wmf-config Scap: T162276: Enable crossproject search [23:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:18] err, i hope thats not a full scap :S [23:12:25] it's supposed to sync the wmf-config dir [23:13:04] that looks like a full scap [23:13:23] scap sync-{file,dir} to sync a dir [23:13:35] uhhhohh [23:13:36] (they both go to the same code at this point [23:14:09] thcipriani: i rememberde the part about how they go to the same code, and for some reason thought they were superseeded with the standalone 'sync' [23:14:31] oh well, it's not the end of the world [23:14:36] ah, no, that's the eventual plan, but not the current behavior [23:15:19] !log ebernhardson@tin Finished scap: wmf-config Scap: T162276: Enable crossproject search (duration: 03m 37s) [23:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:29] it's at least pretty fast these days :) [23:16:06] yeah, if you're not doing an actual update to the l10n it's quick :) [23:17:08] jdlrobson: i suppose since you are removing a dead config flag, theres not much to test? [23:17:20] ebernhardson: exactly :) [23:17:29] jdlrobson: ok, shipping everywhere [23:18:45] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T166408: Remove dead config variable MinervaPrintStyles (duration: 00m 41s) [23:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:55] T166408: Cleanup and remove MinervaPrintStyles config variable - https://phabricator.wikimedia.org/T166408 [23:19:17] jdlrobson: ^ [23:19:23] thanks ebernhardson ! [23:30:40] !log APT - reprepro copy strech-wikimedia jessie-wikimedia jenkins (copy existing jenkins package to stretch, it can be used on both) [23:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:40] RainbowSprinkles: ^ .. and that fixed releases1001.. it now has jenkins :) [23:32:47] on stretch [23:33:05] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [23:33:05] RECOVERY - jenkins_service_running on releases1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [23:34:39] jdlrobson: still around? could use a quick review for a minor visual glitch [23:34:44] ebernhardson: sure [23:34:50] what's the problem? [23:35:48] jdlrobson: sidebar misbehaving based on width, it drops down below the results "too early" so to speak: https://it.wikisource.org/wiki/Special:Search?search=Ricerca&fulltext=1&searchToken=ag4ehx10tyqkelvnf87eaeg4b [23:36:11] jdlrobson: i'm thinking a temp fix is the reduce the left-margin from 8% to 7%, wich is https://gerrit.wikimedia.org/r/359362 [23:36:16] the sidebar as in the thing on the left or right? [23:36:17] longer term...i wish we had a grid [23:36:22] jdlrobson: the right side sister-search results [23:36:29] what screen resolution? [23:37:03] it looks fine to me... [23:37:13] jdlrobson: 1484 width [23:37:18] !log added stretch support for jenkins (https://gerrit.wikimedia.org/r/#/c/359227/, https://gerrit.wikimedia.org/r/#/c/359356/) | 'reprepro copy stretch-wikimedia jessie-wikimedia jenkins' to make .deb available on stretch | releases1001 now running jenkins , icinga recovered | (hashar) (T164030) [23:37:26] jdlrobson: fairly normal for people on 1920 wide laptops that have a default zoom enabled [23:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:27] T164030: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030 [23:38:11] ebernhardson: im seeing it on the right... what seems to be the issue? [23:38:53] jdlrobson: if you slightly zoom in, or narrow the browser window, it drops down below the search results. This is fine generally, but it does it too early when there is still plenty of room [23:39:06] it's supposed to drop down below the search results for mobile [23:39:21] but it's doing it on normal laptops with <1920 resolution [23:39:28] zooming doesnt make it drop down for me [23:39:35] it's based on %age so shouldnt [23:39:43] hmm, it is for me :S [23:39:51] are you in the office? i can walk over and show you [23:40:03] nope i'm at home keeping my flu germs away [23:40:06] :) [23:40:07] the margin left seems a little arbitary [23:40:14] im not sure why that is %age [23:40:45] RECOVERY - mediawiki-installation DSH group on mw2251 is OK: OK [23:41:02] jdlrobson: well, the main results are 60%, and the sidebar is 30%, so it used to be 10% margin. But that had similar issue so jan dropped it to 8% [23:41:12] basically all adding up to 100% [23:41:44] mm. personally i would hardcode a margin for lower resolutions and bump it up to 10% via media query at a sufficient threshold [23:42:39] the problem is you're not using border-box [23:43:32] so even though 60% + 30% + 8% < 100% [23:43:47] you've got a margin on #bodyContent ul [23:43:52] of 1.5em [23:43:53] hmm, with more playing around, it turns out that for my version of chrome, overriding 'margin: 0 0 0 1.5em' on #bodyContent ul makes it all work [23:45:37] jdlrobson: ahh, ok. [23:46:14] ebernhardson: it's a bit hacky.. but you could do @media ...{ .mw-searchresults-has-iw #mw-interwiki-results{ margin-right: -1.5em; } } #bodyContent ul {word-wrap: break-word;} [23:46:39] media.. being "media only screen and (min-width: 720px)" [23:49:01] jdlrobson: hmm, pulling the margin off the other side doesn't sound too bad, and should at least even it out for now. [23:51:06] jdlrobson: so like https://gerrit.wikimedia.org/r/359362 ? [23:56:14] jdlrobson: anyways, thanks! i'll have jan check that out and hopefully we can ship the fix monay [23:56:17] monday [23:56:35] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:57:48] (03CR) 10Dzahn: [C: 032] "yep, definitely covered already by the access request we had approved. it said labsdb*" [puppet] - 10https://gerrit.wikimedia.org/r/359323 (https://phabricator.wikimedia.org/T166310) (owner: 10BryanDavis)