[00:56:34] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:59:45] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-gerrit] [01:16:48] (03CR) 10Jdlrobson: Disable RelatedArticles instrumentation on all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380554 (https://phabricator.wikimedia.org/T174944) (owner: 10Jdlrobson) [01:17:21] (03PS2) 10Jdlrobson: Disable RelatedArticles instrumentation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380554 (https://phabricator.wikimedia.org/T174944) [01:24:44] RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [02:23:40] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.19) (duration: 07m 26s) [02:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:31] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Sep 26 02:29:31 UTC 2017 (duration 5m 52s) [02:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:48:58] 10Operations, 10Edit-Review-Improvements, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), 10Performance: Systematically test load speeds of Watchlist and Recent Changes - https://phabricator.wikimedia.org/T176445#3634196 (10jmatazzoni) [05:09:16] (03PS1) 10BryanDavis: toolforge: Remove /usr/bin/sql [puppet] - 10https://gerrit.wikimedia.org/r/380685 (https://phabricator.wikimedia.org/T176688) [05:10:06] (03PS1) 10Marostegui: db-eqiad.php: Removed old db1055 comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380686 [05:10:50] (03CR) 10Marostegui: "> Remember to remove all temporay comments, I think "low load" db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379757 (owner: 10Jcrespo) [05:12:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Removed old db1055 comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380686 (owner: 10Marostegui) [05:14:31] (03Merged) 10jenkins-bot: db-eqiad.php: Removed old db1055 comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380686 (owner: 10Marostegui) [05:15:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove old db1055 comments (duration: 00m 46s) [05:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:11] (03CR) 10jenkins-bot: db-eqiad.php: Removed old db1055 comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380686 (owner: 10Marostegui) [05:24:57] (03CR) 10Marostegui: "So the current status of both hosts: dbstore2001 and dbstore2002 is:" [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [05:27:49] (03PS2) 10Giuseppe Lavagetto: Add support for build containers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379792 [05:27:51] (03PS2) 10Giuseppe Lavagetto: Add ruby base image and a fluentd image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379793 [05:35:54] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [05:40:24] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [05:43:07] (03PS1) 10Marostegui: db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380687 (https://phabricator.wikimedia.org/T176573) [05:43:44] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2047 got rebooted - https://phabricator.wikimedia.org/T176573#3634327 (10Marostegui) The rack looks fine and so do the PDU and their temperature graphs. Going to repool this host and if it happens again we will really need to look into this as the rac... [05:45:24] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:45:47] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380687 (https://phabricator.wikimedia.org/T176573) (owner: 10Marostegui) [05:47:23] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380687 (https://phabricator.wikimedia.org/T176573) (owner: 10Marostegui) [05:47:33] (03CR) 10jenkins-bot: db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380687 (https://phabricator.wikimedia.org/T176573) (owner: 10Marostegui) [05:47:55] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:48:10] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2047 got rebooted - https://phabricator.wikimedia.org/T176573#3634330 (10Marostegui) 05Open>03Resolved a:03Marostegui [05:48:33] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2047 - T176573 (duration: 00m 44s) [05:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:38] T176573: db2047 got rebooted - https://phabricator.wikimedia.org/T176573 [06:00:25] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:01:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:04:52] <_joe_> uhm already ended I'd say [06:05:19] <_joe_> yep [06:06:36] !log Drop table trackbacks from s5 (only exists on dewiki) - T175051 [06:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:42] T175051: Drop "trackbacks" table on all wikis that have it - https://phabricator.wikimedia.org/T175051 [06:08:34] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:09:04] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:13:14] (03CR) 10ArielGlenn: "How many entities has it been requesting until now? I am guessing fewer?" [puppet] - 10https://gerrit.wikimedia.org/r/380628 (owner: 10Hoo man) [06:15:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [06:16:34] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:17:04] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 503 (expecting: 200) [06:18:39] !log Drop table trackbacks from s4 - T175051 [06:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:45] T175051: Drop "trackbacks" table on all wikis that have it - https://phabricator.wikimedia.org/T175051 [06:19:04] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [06:28:04] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [06:28:32] !log mobrovac@tin Started deploy [cpjobqueue/deploy@17c3833]: (no justification provided) [06:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:01] !log mobrovac@tin Finished deploy [cpjobqueue/deploy@17c3833]: (no justification provided) (duration: 00m 28s) [06:29:04] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1661 bytes in 0.033 second response time [06:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:08] 10Operations, 10Epic, 10Goal, 10Services (doing), and 2 others: End of September milestone: Migrate first production use case - https://phabricator.wikimedia.org/T175637#3634427 (10mobrovac) [06:40:28] (03Draft2) 10Jayprakash12345: Add autopatrolled user group to dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380689 (https://phabricator.wikimedia.org/T176709) [06:40:30] (03CR) 10jerkins-bot: [V: 04-1] Add autopatrolled user group to dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380689 (https://phabricator.wikimedia.org/T176709) (owner: 10Jayprakash12345) [06:45:04] (03PS3) 10Jayprakash12345: Add autopatrolled user group to dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380689 (https://phabricator.wikimedia.org/T176709) [06:45:14] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:45:55] <_joe_> !log restarting varnish backend on cp3033, throwing 503s [06:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:14] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:48:35] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:53:35] (03PS20) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [06:54:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:54:53] (03CR) 10Jcrespo: "+1 to all Manuel says" [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [06:55:44] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:14:43] (03CR) 10Gehel: "Section already added to the docs: https://wikitech.wikimedia.org/wiki/Search#Cold_restart" [puppet] - 10https://gerrit.wikimedia.org/r/380524 (https://phabricator.wikimedia.org/T176409) (owner: 10Gehel) [07:14:51] (03PS2) 10Gehel: elasticsearch: only wait 5 minutes for all nodes in case of cold restart [puppet] - 10https://gerrit.wikimedia.org/r/380524 (https://phabricator.wikimedia.org/T176409) [07:15:58] (03CR) 10Gehel: [C: 032] elasticsearch: only wait 5 minutes for all nodes in case of cold restart [puppet] - 10https://gerrit.wikimedia.org/r/380524 (https://phabricator.wikimedia.org/T176409) (owner: 10Gehel) [07:16:31] !log installing perl security updates [07:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:49] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 3 others: api feature logs should be sent to both eqiad and codfw clusters - https://phabricator.wikimedia.org/T176430#3634475 (10dcausse) @Gehel the indices are properly created and have data in them, switching to codfw for api-feature-... [07:27:44] (03PS21) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [07:28:14] !log Drop trackbacks table from s6 - T175051 [07:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:19] T175051: Drop "trackbacks" table on all wikis that have it - https://phabricator.wikimedia.org/T175051 [07:28:23] (03CR) 10Mattflaschen: [C: 032] RCFilters: cleanup unused variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379719 (owner: 10Sbisson) [07:29:29] (03PS22) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [07:31:20] (03Merged) 10jenkins-bot: RCFilters: cleanup unused variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379719 (owner: 10Sbisson) [07:31:34] (03CR) 10jenkins-bot: RCFilters: cleanup unused variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379719 (owner: 10Sbisson) [07:31:46] (03CR) 10Phedenskog: "Krinkle: Updated the raw data (I took a couple of rows), it works, but I need to manually check the output that it looks ok. I'll do that " [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [07:32:40] (03PS1) 10Muehlenhoff: Remove a stray import of salt.client [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/380691 [07:36:09] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3634518 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['mw1327.eqiad.wmnet', 'mw1328.eqiad.wmnet... [07:36:42] (03CR) 10Muehlenhoff: [C: 032] Remove a stray import of salt.client [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/380691 (owner: 10Muehlenhoff) [07:37:38] (03Draft2) 10Jayprakash12345: Restrict local uploads on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380690 (https://phabricator.wikimedia.org/T176706) [07:44:45] (03PS7) 10DCausse: Upgrade plugins to elastic 5.5.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376477 (https://phabricator.wikimedia.org/T175159) [07:46:11] !log Drop trackbacks table from s7 - T175051 [07:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:16] T175051: Drop "trackbacks" table on all wikis that have it - https://phabricator.wikimedia.org/T175051 [07:47:14] (03CR) 10DCausse: "yes, should be good to merge, we can even deploy it on the experimental repo with elastic 5.5.2, vagrant should then start to pick up elas" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376477 (https://phabricator.wikimedia.org/T175159) (owner: 10DCausse) [07:47:58] (03PS1) 10Gehel: logstash: cleanup old api feature usage indices for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/380692 (https://phabricator.wikimedia.org/T176430) [07:48:46] !log mattflaschen@tin Synchronized wmf-config/CommonSettings-labs.php: RCFilters: Beta Cluster only (duration: 00m 46s) [07:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:31] (03CR) 10Gehel: "Puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/8018/" [puppet] - 10https://gerrit.wikimedia.org/r/380692 (https://phabricator.wikimedia.org/T176430) (owner: 10Gehel) [07:58:55] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0 [07:59:15] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0 [07:59:38] (03CR) 10Gehel: [C: 032] "LGTM" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376477 (https://phabricator.wikimedia.org/T175159) (owner: 10DCausse) [07:59:44] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [07:59:51] (03PS1) 10Elukey: hieradata::regex: allow notifications for mw132[0-8] [puppet] - 10https://gerrit.wikimedia.org/r/380693 (https://phabricator.wikimedia.org/T165519) [08:00:35] (03CR) 10Elukey: [C: 032] hieradata::regex: allow notifications for mw132[0-8] [puppet] - 10https://gerrit.wikimedia.org/r/380693 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [08:01:11] (03PS4) 10Filippo Giunchedi: Thumbor: expose Nginx request time [puppet] - 10https://gerrit.wikimedia.org/r/380483 (https://phabricator.wikimedia.org/T161535) (owner: 10Gilles) [08:06:27] (03CR) 10Gehel: [V: 032 C: 032] Upgrade plugins to elastic 5.5.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376477 (https://phabricator.wikimedia.org/T175159) (owner: 10DCausse) [08:07:03] (03CR) 10DCausse: logstash: cleanup old api feature usage indices for all clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/380692 (https://phabricator.wikimedia.org/T176430) (owner: 10Gehel) [08:07:03] !log Drop table trackbacks from s3 - T175051 [08:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:09] T175051: Drop "trackbacks" table on all wikis that have it - https://phabricator.wikimedia.org/T175051 [08:07:50] (03CR) 10Muehlenhoff: [C: 04-1] "The parts related to shell access are fine, but you need to also drop him from ldap_only_users" [puppet] - 10https://gerrit.wikimedia.org/r/380565 (https://phabricator.wikimedia.org/T176529) (owner: 10Dzahn) [08:08:57] !log Drop old temporary tar.gz files from dbstore1001 to get back some disk space [08:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:53] !log roll-restart swift-proxy to apply https://gerrit.wikimedia.org/r/#/c/380483/ - T161535 [08:09:56] (03PS2) 10Gehel: logstash: cleanup old api feature usage indices for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/380692 (https://phabricator.wikimedia.org/T176430) [08:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:58] T161535: Track nginx request time in Thumbor debug headers - https://phabricator.wikimedia.org/T161535 [08:10:30] (03CR) 10Gehel: logstash: cleanup old api feature usage indices for all clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/380692 (https://phabricator.wikimedia.org/T176430) (owner: 10Gehel) [08:14:00] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:14:20] PROBLEM - HHVM rendering on mw1327 is CRITICAL: connect to address 10.64.32.48 and port 80: Connection refused [08:14:20] PROBLEM - puppet last run on mw1327 is CRITICAL: Return code of 255 is out of bounds [08:15:20] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:15:40] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [08:16:20] PROBLEM - Apache HTTP on mw1327 is CRITICAL: connect to address 10.64.32.48 and port 80: Connection refused [08:16:20] PROBLEM - MD RAID on mw1327 is CRITICAL: Return code of 255 is out of bounds [08:16:22] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 1.5 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/379218 (https://phabricator.wikimedia.org/T173804) (owner: 10Gilles) [08:17:30] PROBLEM - Nginx local proxy to apache on mw1327 is CRITICAL: connect to address 10.64.32.48 and port 443: Connection refused [08:17:30] PROBLEM - Check size of conntrack table on mw1327 is CRITICAL: Return code of 255 is out of bounds [08:18:05] 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review: Nginx timeouts on Thumbor - https://phabricator.wikimedia.org/T150746#3634599 (10Gilles) [08:18:30] PROBLEM - Check systemd state on mw1327 is CRITICAL: Return code of 255 is out of bounds [08:19:31] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1327 is CRITICAL: Return code of 255 is out of bounds [08:19:31] PROBLEM - configured eth on mw1327 is CRITICAL: Return code of 255 is out of bounds [08:20:30] PROBLEM - Check whether ferm is active by checking the default input chain on mw1327 is CRITICAL: Return code of 255 is out of bounds [08:20:30] PROBLEM - dhclient process on mw1327 is CRITICAL: Return code of 255 is out of bounds [08:21:30] PROBLEM - DPKG on mw1327 is CRITICAL: Return code of 255 is out of bounds [08:21:30] PROBLEM - mediawiki-installation DSH group on mw1327 is CRITICAL: Host mw1327 is not in mediawiki-installation dsh group [08:22:12] this is me [08:22:30] PROBLEM - nutcracker port on mw1327 is CRITICAL: Return code of 255 is out of bounds [08:22:30] PROBLEM - Disk space on mw1327 is CRITICAL: Return code of 255 is out of bounds [08:23:14] I removed the disabled notifications for mw132[0-8] while reimaging 7/8 [08:23:30] PROBLEM - HHVM processes on mw1327 is CRITICAL: Return code of 255 is out of bounds [08:23:30] PROBLEM - nutcracker process on mw1327 is CRITICAL: Return code of 255 is out of bounds [08:24:25] !log roll-restart thumbor to upgrade to 1.5 - T173804 [08:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:30] T173804: Rotation parameters from EXIF fail in Mediawiki - https://phabricator.wikimedia.org/T173804 [08:30:08] (03PS2) 10Filippo Giunchedi: swift: don't track connections to swift backend services on frontend machines [puppet] - 10https://gerrit.wikimedia.org/r/374170 (https://phabricator.wikimedia.org/T173731) [08:31:55] (03CR) 10Volans: "What about autodiscovery for the SHARDS? Something like:" [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [08:32:20] (03PS1) 10Giuseppe Lavagetto: Add proxy support to the build process itself [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/380696 [08:32:22] (03PS1) 10Giuseppe Lavagetto: Add image for running fluentd as a daemonset in kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/380697 [08:32:31] (03CR) 10Filippo Giunchedi: "> That looks fine, but I would prefer if we could use the opportunity" [puppet] - 10https://gerrit.wikimedia.org/r/374170 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [08:32:51] (03CR) 10DCausse: [C: 031] logstash: cleanup old api feature usage indices for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/380692 (https://phabricator.wikimedia.org/T176430) (owner: 10Gehel) [08:33:15] (03PS3) 10Gehel: logstash: cleanup old api feature usage indices for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/380692 (https://phabricator.wikimedia.org/T176430) [08:33:27] (03PS3) 10Filippo Giunchedi: swift: don't track connections to swift backend services on frontend machines [puppet] - 10https://gerrit.wikimedia.org/r/374170 (https://phabricator.wikimedia.org/T173731) [08:33:31] (03CR) 10Volans: [C: 032] OpenStack: limit grammar to not overlap the global one [software/cumin] - 10https://gerrit.wikimedia.org/r/380653 (owner: 10Volans) [08:33:53] (03CR) 10Gehel: [C: 032] logstash: cleanup old api feature usage indices for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/380692 (https://phabricator.wikimedia.org/T176430) (owner: 10Gehel) [08:34:09] (03CR) 10Filippo Giunchedi: [C: 032] swift: don't track connections to swift backend services on frontend machines [puppet] - 10https://gerrit.wikimedia.org/r/374170 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [08:34:48] (03PS4) 10Filippo Giunchedi: swift: don't track connections to swift backend services on frontend machines [puppet] - 10https://gerrit.wikimedia.org/r/374170 (https://phabricator.wikimedia.org/T173731) [08:35:51] (03Merged) 10jenkins-bot: OpenStack: limit grammar to not overlap the global one [software/cumin] - 10https://gerrit.wikimedia.org/r/380653 (owner: 10Volans) [08:37:22] (03CR) 10Jcrespo: "Volans, thanks- that is actually useful to make things work! I would do it inversely- consult remotely and try to backup remotely if possi" [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [08:38:11] (03CR) 10Marostegui: "> What about autodiscovery for the SHARDS? Something like:" [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [08:39:15] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.1.1. [software/cumin] - 10https://gerrit.wikimedia.org/r/380698 [08:42:48] (03CR) 10Volans: [C: 032] CHANGELOG: add changelogs for release v1.1.1. [software/cumin] - 10https://gerrit.wikimedia.org/r/380698 (owner: 10Volans) [08:44:43] !log run puppet on ms-fe* to reduce conntrack generated by swift internal clients towards backends - T173731 [08:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:48] T173731: Reduce swift frontend conntrack usage - https://phabricator.wikimedia.org/T173731 [08:44:55] PROBLEM - HHVM rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:45:06] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.1.1. [software/cumin] - 10https://gerrit.wikimedia.org/r/380698 (owner: 10Volans) [08:46:36] 10Operations, 10Ops-Access-Requests, 10Research: Server access for Miriam Redi - https://phabricator.wikimedia.org/T176682#3634723 (10Miriam) Document Signed. [08:46:48] !log installing apache updates on mw* app servers [08:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:55] PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:47:25] PROBLEM - Nginx local proxy to apache on mw2145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:47:55] PROBLEM - Nginx local proxy to apache on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:48:15] RECOVERY - Nginx local proxy to apache on mw2145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.196 second response time [08:50:07] (03PS2) 10Muehlenhoff: Remove some salt references [puppet] - 10https://gerrit.wikimedia.org/r/380500 [08:50:53] (03CR) 10Muehlenhoff: [C: 032] Remove some salt references [puppet] - 10https://gerrit.wikimedia.org/r/380500 (owner: 10Muehlenhoff) [08:51:46] PROBLEM - mediawiki-installation DSH group on mw1328 is CRITICAL: Host mw1328 is not in mediawiki-installation dsh group [08:52:24] (03CR) 10Jcrespo: "I would do /opt/wmf-mariadb101/bin/mysqladmin ping -h -P first, then check localhost (as manuel suggests), then error" [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [08:53:35] RECOVERY - Disk space on copper is OK: DISK OK [08:54:02] (03PS1) 10Hashar: package_builder: always export $BACKPORTS [puppet] - 10https://gerrit.wikimedia.org/r/380704 (https://phabricator.wikimedia.org/T173999) [08:54:45] (03CR) 10Marostegui: "> I would do /opt/wmf-mariadb101/bin/mysqladmin ping -h " [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [08:57:13] (03CR) 10Hashar: "recheck" [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/379638 (owner: 10Volans) [09:01:27] (03PS1) 10Gehel: elasticsearch: cleanup curator config dir [puppet] - 10https://gerrit.wikimedia.org/r/380705 [09:04:02] (03CR) 10DCausse: [C: 031] elasticsearch: cleanup curator config dir [puppet] - 10https://gerrit.wikimedia.org/r/380705 (owner: 10Gehel) [09:04:21] (03CR) 10Gehel: [C: 032] elasticsearch: cleanup curator config dir [puppet] - 10https://gerrit.wikimedia.org/r/380705 (owner: 10Gehel) [09:05:13] (03CR) 10Hashar: "WIP. I also added in the sudo rules for jenkins-deploy:" [puppet] - 10https://gerrit.wikimedia.org/r/380704 (https://phabricator.wikimedia.org/T173999) (owner: 10Hashar) [09:05:32] (03PS1) 10Volans: Upstream release 1.1.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/380706 [09:05:45] (03PS2) 10Muehlenhoff: Remove salt-key from dc-ops privileges [puppet] - 10https://gerrit.wikimedia.org/r/380475 [09:05:55] PROBLEM - Check systemd state on copper is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:06:18] <_joe_> yeah that's me [09:06:33] (03CR) 10Muehlenhoff: [C: 032] Remove salt-key from dc-ops privileges [puppet] - 10https://gerrit.wikimedia.org/r/380475 (owner: 10Muehlenhoff) [09:07:58] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Reduce swift frontend conntrack usage - https://phabricator.wikimedia.org/T173731#3634768 (10fgiunchedi) 05Open>03Resolved We're now explicitly excluding statsite traffic and swift clients running on the proxy that talk to backend sw... [09:10:13] <_joe_> !log rebooting copper to verify docker setup is correct [09:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:53] (03PS3) 10Ema: LVS: turn off ip_early_demux [puppet] - 10https://gerrit.wikimedia.org/r/379798 (owner: 10BBlack) [09:11:59] (03CR) 10Ema: [V: 032 C: 032] LVS: turn off ip_early_demux [puppet] - 10https://gerrit.wikimedia.org/r/379798 (owner: 10BBlack) [09:12:06] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:15] PROBLEM - puppet last run on ms-be1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:16] PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:25] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:25] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:26] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:35] PROBLEM - puppet last run on poolcounter1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:35] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:36] PROBLEM - puppet last run on wtp1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:36] PROBLEM - puppet last run on wtp1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:36] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:36] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:37] !log add mw132[4,5,6] to live traffic (new appservers) - weights will be increased incrementally from 5 to 20 [09:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:56] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:13:03] nitrogen 502s [09:13:05] PROBLEM - puppet last run on db1097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:13:05] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:13:15] PROBLEM - puppet last run on db1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:13:17] <_joe_> elukey: yeah, can you look if it restarted? [09:13:26] doing it now [09:13:26] <_joe_> I'm working on smth else atm [09:13:29] <_joe_> thanks [09:13:47] puppetdb: Active: active (running) since Tue 2017-09-26 09:09:49 UTC; 3min 45s ago [09:14:35] RECOVERY - puppet last run on wtp1037 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:14:55] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[docker] [09:14:55] PROBLEM - Check systemd state on copper is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:15:05] RECOVERY - puppet last run on db1097 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:15:16] PROBLEM - puppet last run on mw1327 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:15:22] (03PS1) 10Gehel: logstash: remove cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/380707 [09:15:25] RECOVERY - Apache HTTP on mw1327 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [09:15:25] PROBLEM - Check whether ferm is active by checking the default input chain on mw1327 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:15:25] PROBLEM - Check systemd state on mw1327 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:15:26] PROBLEM - MD RAID on mw1327 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:15:26] PROBLEM - nutcracker port on mw1327 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:15:26] PROBLEM - Disk space on mw1327 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:15:26] PROBLEM - DPKG on mw1327 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:15:27] RECOVERY - Nginx local proxy to apache on mw1327 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.007 second response time [09:15:27] PROBLEM - Check size of conntrack table on mw1327 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:15:28] PROBLEM - configured eth on mw1327 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:15:28] PROBLEM - nutcracker process on mw1327 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:15:29] PROBLEM - HHVM processes on mw1327 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:15:29] PROBLEM - dhclient process on mw1327 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:15:46] sigh, those are mine [09:17:15] RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 1.022 second response time [09:17:15] RECOVERY - Nginx local proxy to apache on mw1328 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 3.058 second response time [09:17:29] downtimed [09:20:30] (03Abandoned) 10Hashar: package_builder: always export $BACKPORTS [puppet] - 10https://gerrit.wikimedia.org/r/380704 (https://phabricator.wikimedia.org/T173999) (owner: 10Hashar) [09:20:55] RECOVERY - Check systemd state on copper is OK: OK - running: The system is fully operational [09:21:57] (03CR) 10Elukey: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler02/8021/mw1280.eqiad.wmnet/change.mw1280.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/380472 (owner: 10Elukey) [09:24:25] RECOVERY - Check whether ferm is active by checking the default input chain on mw1327 is OK: OK ferm input default policy is set [09:24:25] RECOVERY - Check systemd state on mw1327 is OK: OK - running: The system is fully operational [09:24:25] RECOVERY - Disk space on mw1327 is OK: DISK OK [09:24:25] RECOVERY - Check size of conntrack table on mw1327 is OK: OK: nf_conntrack is 0 % full [09:24:25] RECOVERY - configured eth on mw1327 is OK: OK - interfaces up [09:24:26] RECOVERY - MD RAID on mw1327 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [09:24:26] RECOVERY - HHVM processes on mw1327 is OK: PROCS OK: 6 processes with command name hhvm [09:24:27] RECOVERY - dhclient process on mw1327 is OK: PROCS OK: 0 processes with command name dhclient [09:25:25] RECOVERY - DPKG on mw1327 is OK: All packages OK [09:25:53] (03PS2) 10Elukey: mediawiki::web::modules: force dependency between apache confs [puppet] - 10https://gerrit.wikimedia.org/r/380472 [09:26:34] (03CR) 10Hashar: "recheck" [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/380706 (owner: 10Volans) [09:31:28] (03PS4) 10Filippo Giunchedi: base: ability to send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) [09:31:43] (03CR) 10jerkins-bot: [V: 04-1] base: ability to send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [09:33:09] (03CR) 10Muehlenhoff: "That would work, but I'm wondering whether a "require_package('apache2')" wouldn't be cleaner (which also ensures that the installation of" [puppet] - 10https://gerrit.wikimedia.org/r/380472 (owner: 10Elukey) [09:36:48] (03CR) 10Muehlenhoff: nutcracker: create the service only after the package install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/380487 (owner: 10Elukey) [09:39:45] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/b1020eed7c3c0362670c028fe2e5036bb9b7228599c0142c360541b02613157f/shm is not accessible: Permission denied [09:40:15] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:40:25] PROBLEM - HHVM rendering on mw2127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:40:26] RECOVERY - nutcracker port on mw1327 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [09:40:35] RECOVERY - nutcracker process on mw1327 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [09:40:35] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:40:35] RECOVERY - puppet last run on ms-be1034 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:40:35] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:40:35] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:40:45] RECOVERY - puppet last run on poolcounter1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:40:45] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:40:45] RECOVERY - puppet last run on wtp1048 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:40:46] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:41:06] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:41:15] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:41:16] RECOVERY - HHVM rendering on mw2127 is OK: HTTP OK: HTTP/1.1 200 OK - 76204 bytes in 0.302 second response time [09:41:26] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:41:50] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:42:20] RECOVERY - HHVM rendering on mw1327 is OK: HTTP OK: HTTP/1.1 200 OK - 76148 bytes in 0.205 second response time [09:42:21] RECOVERY - puppet last run on db1099 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:46:27] (03PS5) 10Filippo Giunchedi: base: ability to send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) [09:46:42] (03CR) 10jerkins-bot: [V: 04-1] base: ability to send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [09:48:52] (03PS6) 10Filippo Giunchedi: base: ability to send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) [09:49:30] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1327 is OK: OK: synced at Tue 2017-09-26 09:49:23 UTC. [09:50:40] RECOVERY - puppet last run on mw1327 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [09:51:30] RECOVERY - HHVM rendering on mw1328 is OK: HTTP OK: HTTP/1.1 200 OK - 76148 bytes in 0.128 second response time [09:51:32] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3634816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1327.eqiad.wmnet', 'mw1328.eqiad.wmnet'] ``` and were **ALL** successful. [09:54:47] \o/ [09:59:10] (03PS1) 10Hashar: Run tests with the upstream version [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/380711 [09:59:31] (03CR) 10Mark Bergsma: [C: 031] bgp: bugfixes in MPUnreachNLRI attribute construction [debs/pybal] - 10https://gerrit.wikimedia.org/r/379973 (owner: 10Ema) [09:59:45] (03PS1) 10Muehlenhoff: mediawiki::packages::fonts: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/380712 [09:59:51] RECOVERY - Disk space on copper is OK: DISK OK [10:04:52] (03CR) 10Mark Bergsma: [C: 04-1] "Found a typo in the test class name." (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/379973 (owner: 10Ema) [10:08:00] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [10:08:14] (03PS7) 10Filippo Giunchedi: base: ability to send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) [10:16:30] (03PS1) 10Volans: sshd: skip LDAP lookup for the root user [puppet] - 10https://gerrit.wikimedia.org/r/380717 (https://phabricator.wikimedia.org/T176609) [10:17:04] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [10:18:09] (03PS8) 10Filippo Giunchedi: base: ability to send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) [10:19:35] (03PS2) 10Ema: bgp: bugfixes in MPUnreachNLRI attribute construction [debs/pybal] - 10https://gerrit.wikimedia.org/r/379973 [10:20:23] PROBLEM - Check systemd state on copper is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:20:52] <_joe_> !log rebooting copper (again) [10:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:13] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [10:24:13] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [10:25:22] (03CR) 10Mark Bergsma: [C: 031] bgp: bugfixes in MPUnreachNLRI attribute construction [debs/pybal] - 10https://gerrit.wikimedia.org/r/379973 (owner: 10Ema) [10:25:24] PROBLEM - Check systemd state on copper is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:25:33] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[docker] [10:25:44] (03CR) 10Volans: "minor comment" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/379973 (owner: 10Ema) [10:26:18] !log joal@tin Started deploy [analytics/refinery@5d806c6]: Regular analytics deploy [10:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:17] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/380717 (https://phabricator.wikimedia.org/T176609) (owner: 10Volans) [10:32:11] (03PS2) 10Volans: Upstream release 1.1.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/380706 [10:34:18] !log joal@tin Finished deploy [analytics/refinery@5d806c6]: Regular analytics deploy (duration: 08m 00s) [10:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:53] (03CR) 10Daniel Kinzler: [C: 031] "After some more RFC reading and head scratching, I come to the conclusion that all available options are bad :)" [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [10:39:18] (03CR) 10Volans: [C: 032] Upstream release 1.1.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/380706 (owner: 10Volans) [10:40:26] (03CR) 10Hashar: [C: 032] Upstream release 1.1.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/380706 (owner: 10Volans) [10:40:42] (03Abandoned) 10Hashar: Run tests with the upstream version [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/380711 (owner: 10Hashar) [10:43:33] RECOVERY - Check systemd state on copper is OK: OK - running: The system is fully operational [10:45:46] (03Merged) 10jenkins-bot: Upstream release 1.1.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/380706 (owner: 10Volans) [10:48:13] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/cf615e41106e7874aecba0ca0719f8681e0f456096ed2b2d26c89d7eef77556c/shm is not accessible: Permission denied [10:50:13] RECOVERY - Disk space on copper is OK: DISK OK [10:52:36] !log uploaded cumin_1.1.1-1_amd64.deb to apt.wikimedia.org jessie-wikimedia [10:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:02] \o/ [10:58:05] (03PS1) 10ArielGlenn: Move daaset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) [10:58:30] (03CR) 10jerkins-bot: [V: 04-1] Move daaset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [11:00:38] (03CR) 10Dereckson: "We've a consensus to discard the 303." [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [11:06:05] (03CR) 10Dereckson: [C: 04-1] "No EDP for dty.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380690 (https://phabricator.wikimedia.org/T176706) (owner: 10Jayprakash12345) [11:07:18] (03CR) 10Daniel Kinzler: [C: 031] "@Derekson It's not temporary. So in practical terms, 301 is probably best." [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [11:07:22] (03CR) 10MarcoAurelio: [C: 04-1] Add autopatrolled user group to dty.wikipedia (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380689 (https://phabricator.wikimedia.org/T176709) (owner: 10Jayprakash12345) [11:08:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Well, 302 is actually a mess in its own. The "Moved Temporarily" description was the one in RFC1945 (HTTP/1.0). It was then implemented by" [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [11:09:18] (03PS2) 10ArielGlenn: Move daaset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) [11:09:42] (03CR) 10jerkins-bot: [V: 04-1] Move daaset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [11:10:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] "And of course rfc 7231 altered 302 once more. Now the cacheable stuff is not even mentioned. Anyway I think we have gone down the rabbitho" [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [11:11:46] (03PS4) 10Gilles: Expose Thumbor swift username [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376043 (https://phabricator.wikimedia.org/T144479) [11:12:57] (03PS1) 10Muehlenhoff: Add support for stretch to hhvm::debug [puppet] - 10https://gerrit.wikimedia.org/r/380722 [11:13:18] (03CR) 10jerkins-bot: [V: 04-1] Add support for stretch to hhvm::debug [puppet] - 10https://gerrit.wikimedia.org/r/380722 (owner: 10Muehlenhoff) [11:14:27] (03PS4) 10Alexandros Kosiaris: Add /data/ Redirect for commons [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [11:14:54] (03PS3) 10ArielGlenn: Move daaset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) [11:15:14] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks fine to me, gonna leave a few hours for anyone to object and then merge and shepherd into production." [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [11:15:19] (03CR) 10jerkins-bot: [V: 04-1] Move daaset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [11:16:33] (03PS4) 10Jayprakash12345: Add autopatrolled user group to dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380689 (https://phabricator.wikimedia.org/T176709) [11:17:50] (03CR) 10Daniel Kinzler: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [11:18:15] (03PS4) 10ArielGlenn: Move daaset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) [11:18:55] (03PS2) 10Muehlenhoff: Add support for stretch to hhvm::debug [puppet] - 10https://gerrit.wikimedia.org/r/380722 [11:19:41] (03CR) 10Gilles: [C: 031] webperf: Fix crash when event contains browser_major:null [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) (owner: 10Krinkle) [11:20:50] (03CR) 10Gilles: [C: 031] webperf: Add navtiming tests to puppet.git:/tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/379830 (owner: 10Krinkle) [11:25:03] (03Abandoned) 10Jayprakash12345: Restrict local uploads on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380690 (https://phabricator.wikimedia.org/T176706) (owner: 10Jayprakash12345) [11:25:39] (03PS5) 10ArielGlenn: Move daaset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) [11:28:53] !log kartik@tin Started deploy [cxserver/deploy@f1d4851]: Update cxserver, registry reconstruction [11:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:46] !log kartik@tin Finished deploy [cxserver/deploy@f1d4851]: Update cxserver, registry reconstruction (duration: 00m 52s) [11:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:54] (03PS6) 10ArielGlenn: Move daaset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) [11:30:18] !log kartik@tin (no justification provided) [11:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:00] (03CR) 10Hashar: [C: 031] "\O/" [puppet] - 10https://gerrit.wikimedia.org/r/380712 (owner: 10Muehlenhoff) [11:34:58] (03PS7) 10ArielGlenn: Move daaset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) [11:49:02] !log uploaded prometheus-hhvm-exporter 0.3.1+deb9u1 to apt.wikimedia.org/stretch-wikimedia [11:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:04] (03CR) 10Ema: bgp: bugfixes in MPUnreachNLRI attribute construction (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/379973 (owner: 10Ema) [11:55:32] (03CR) 10Ema: [C: 032] bgp: bugfixes in MPUnreachNLRI attribute construction [debs/pybal] - 10https://gerrit.wikimedia.org/r/379973 (owner: 10Ema) [11:55:45] (03PS1) 10Ema: bgp: bugfixes in MPUnreachNLRI attribute construction [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/380724 [11:55:53] (03CR) 10Faidon Liambotis: [C: 031] "This probably explains login issues/delays when we had the LDAP outage?" [puppet] - 10https://gerrit.wikimedia.org/r/380717 (https://phabricator.wikimedia.org/T176609) (owner: 10Volans) [11:56:59] !log Drop redundant indexes from enwiki on codfw - T174509 [11:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:05] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [11:57:26] marostegui <3 [11:57:45] haha \o/ [11:58:29] !log uploaded prometheus-apache-exporter 0.3-1+deb9u1 to apt.wikimedia.org/stretch-wikimedia [11:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:22] (03CR) 10Ema: [C: 032] bgp: bugfixes in MPUnreachNLRI attribute construction [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/380724 (owner: 10Ema) [12:00:12] !log kartik@tin Started deploy [cxserver/deploy@aa232ee]: Update cxserver, registry reconstruction [12:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:16] !log kartik@tin Finished deploy [cxserver/deploy@aa232ee]: Update cxserver, registry reconstruction (duration: 03m 03s) [12:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:20] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3599592 (10faidon) I may be missing something, but why do we need //client// cer... [12:06:10] (03PS1) 10Muehlenhoff: Adapt mediawiki::packages::math for stretch [puppet] - 10https://gerrit.wikimedia.org/r/380726 [12:06:32] (03CR) 10jerkins-bot: [V: 04-1] Adapt mediawiki::packages::math for stretch [puppet] - 10https://gerrit.wikimedia.org/r/380726 (owner: 10Muehlenhoff) [12:07:44] (03PS1) 10Ema: bgp: use set comprehensions instead of set(list comprehension) [debs/pybal] - 10https://gerrit.wikimedia.org/r/380727 [12:08:05] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3635120 (10jcrespo) we do not need client certs- we need the "public" CA being a... [12:11:04] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apache2] [12:11:40] (03CR) 10Volans: [C: 031] "LGTM, thanks for the improvement!" [debs/pybal] - 10https://gerrit.wikimedia.org/r/380727 (owner: 10Ema) [12:12:01] (03CR) 10Ema: [C: 032] bgp: use set comprehensions instead of set(list comprehension) [debs/pybal] - 10https://gerrit.wikimedia.org/r/380727 (owner: 10Ema) [12:12:09] (03PS2) 10Muehlenhoff: Adapt mediawiki::packages::math for stretch [puppet] - 10https://gerrit.wikimedia.org/r/380726 [12:12:11] (03PS1) 10Ema: bgp: use set comprehensions instead of set(list comprehension) [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/380729 [12:12:48] !log Drop redundant indexes from enwiki on eqiad - T174509 [12:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:53] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [12:15:19] (03CR) 10Ema: [C: 032] bgp: use set comprehensions instead of set(list comprehension) [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/380729 (owner: 10Ema) [12:15:26] (03CR) 10Faidon Liambotis: [C: 04-2] "I had a closer look at the code. I don't see why it wouldn't work as it is right now: the code does an out-of-scope variable lookup and th" [puppet] - 10https://gerrit.wikimedia.org/r/377986 (https://phabricator.wikimedia.org/T175864) (owner: 10Hashar) [12:16:00] hashar: ^^ [12:20:44] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3635125 (10faidon) That isn't needed. We import the puppet CA to the host's cert... [12:21:47] paravoid: ha the geoip thing. I have absolutely no idea why the variable lookup fails (and fallback to an empty string) [12:21:53] does it? [12:22:01] I don't think it does [12:23:02] and P6006 works (but there's a typo in there, geo*p*ip_destdir, maybe that threw you off?) [12:23:03] I will dig a bit more. I am not entirely why exactly I needed that [12:23:19] what were the symptoms that made you think that the variable lookup fails? [12:24:38] puppet tried to install the files under /GeoIP/ (instead of /usr/share) [12:24:54] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3599592 (10Joe) No puppet patch is needed, if you just need the CA cert availabl... [12:25:17] that was on the CI puppet master ( integration-puppetmaster01.integration.eqiad.wmflabs ) [12:26:25] also the cron entry that updates the file missed the /usr/share prefix :( [12:26:45] that is probably still reproduble on the CI puppet master and one of the agent [12:27:09] !log Drop redundant indexes from s2 - T174509 [12:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:14] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [12:30:43] PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[apache2],Package[apache2-utils] [12:31:04] (03PS2) 10Gehel: logstash: remove cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/380707 [12:31:41] (03CR) 10Hashar: [C: 031] Disable RelatedArticles instrumentation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380554 (https://phabricator.wikimedia.org/T174944) (owner: 10Jdlrobson) [12:32:02] hashar: I am trying to replicate what you say right now and I can't (I 've already reverted that patch from the integration-puppetmaster01). I 've even removed files from /var/lib/puppet/volatile and re-run the cron entry (which looks correct) [12:32:02] (03CR) 10Gehel: [C: 032] logstash: remove cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/380707 (owner: 10Gehel) [12:32:09] (03CR) 10Hashar: [C: 031] Enable asia-specific Navigation Timing metric [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380441 (https://phabricator.wikimedia.org/T169522) (owner: 10Gilles) [12:33:16] I am kind of worried because if what you say is true, that behavior is a number of places across our manifests [12:33:18] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3635140 (10Joe) Looking at http://php.net/manual/en/mysqli.ssl-set.php, I would... [12:33:21] is in* [12:33:28] which would bite us badly [12:33:36] (03CR) 10Hashar: [C: 04-1] "Since there is a Depends-On I0f81a013ec994eee3f156a89f29f4fcfc37c42b7 , I guess you need https://gerrit.wikimedia.org/r/#/c/376251/ to b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376043 (https://phabricator.wikimedia.org/T144479) (owner: 10Gilles) [12:34:09] (03PS3) 10Muehlenhoff: Add support for stretch to hhvm::debug [puppet] - 10https://gerrit.wikimedia.org/r/380722 [12:34:22] gilles: looks like one of your mediawiki-config patch depends on a change in mediawiki/core that is not in the wmf branches yet :] [12:34:32] hashar: that's fine [12:35:17] ahh the new settings would just be ignored [12:35:18] it's probably the case for both [12:35:20] right [12:35:22] (should have looked at the code [12:35:38] (03CR) 10Hashar: [C: 031] "That is just new settings. It is back compatible :]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376043 (https://phabricator.wikimedia.org/T144479) (owner: 10Gilles) [12:38:45] hashar: did you see akosiaris' comment above? [12:39:24] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [12:39:50] (03PS5) 10Elukey: Introduce profile::druid::worker [puppet] - 10https://gerrit.wikimedia.org/r/380449 (https://phabricator.wikimedia.org/T176223) [12:40:06] (03PS2) 10Gehel: wdqs: reduce heap size to 12GB [puppet] - 10https://gerrit.wikimedia.org/r/378227 (https://phabricator.wikimedia.org/T175919) [12:40:25] (03CR) 10Elukey: Introduce profile::druid::worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/380449 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [12:40:44] (03CR) 10Gehel: [C: 032] wdqs: reduce heap size to 12GB [puppet] - 10https://gerrit.wikimedia.org/r/378227 (https://phabricator.wikimedia.org/T175919) (owner: 10Gehel) [12:44:39] !log restarting blazegraph on wdqs1004 / wdqs2001 for heap resive - T175919 [12:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:44] T175919: investigate GC times on wikidata query service - https://phabricator.wikimedia.org/T175919 [12:47:29] (03PS8) 10ArielGlenn: Move dataset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) [12:47:37] (03CR) 10Elukey: "> I was trying to sync up the LVS patch with this. Since we're going" [puppet] - 10https://gerrit.wikimedia.org/r/380449 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [12:47:47] 10Operations, 10cloud-services-team (Kanban): puppet ca_server confusion - https://phabricator.wikimedia.org/T176437#3635161 (10akosiaris) It's also used in revocation checks which can and will happen by masters and not just agents, in order to verify the agent is authorized to connect to them and obtain the c... [12:47:55] (03CR) 10jerkins-bot: [V: 04-1] Move dataset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [12:54:40] paravoid: nop sorry. I will try to reproduce ( ping akosiaris ) [12:54:54] and yeah if that patterns is used else where, that is indeed worrying [12:55:18] I don't think it is, that kind of syntax should work fine I think [12:56:06] hashar: I am round [12:56:09] around* [12:56:29] fyi, I 've removed that patch from integration-puppetmaster01 and I can not reproduce it [12:58:22] (03CR) 10Muehlenhoff: "Cherrypicked to deployment-prep, NOP on jessie, fixes package installation on stretch." [puppet] - 10https://gerrit.wikimedia.org/r/380722 (owner: 10Muehlenhoff) [12:58:46] (03CR) 10Muehlenhoff: "Cherrypicked to deployment-prep, NOP on jessie, fixes package installation on stretch." [puppet] - 10https://gerrit.wikimedia.org/r/380726 (owner: 10Muehlenhoff) [12:59:04] RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:00:02] (03CR) 10Giuseppe Lavagetto: [C: 031] Adapt mediawiki::packages::math for stretch [puppet] - 10https://gerrit.wikimedia.org/r/380726 (owner: 10Muehlenhoff) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170926T1300). [13:00:04] Jdlrobson and gilles: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:35] !log add mw132[7,8] to live traffic (new appservers) - weights will be increased incrementally from 5 to 20 [13:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:04] !log Drop redundant indexes from s6 - T174509 [13:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:10] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [13:04:02] akosiaris: paravoid: indeed it works perfectly now. I swear the variable somehow ended up undefined :( [13:04:48] (03Abandoned) 10Hashar: puppetmaster: pass volatile_dir to geoip class [puppet] - 10https://gerrit.wikimedia.org/r/377986 (https://phabricator.wikimedia.org/T175864) (owner: 10Hashar) [13:06:41] 10Operations, 10Continuous-Integration-Infrastructure, 10DNS, 10Traffic, and 2 others: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3635184 (10hashar) 05Open>03Resolved a:03hashar Apparently that was transient or puppet was not willing to c... [13:06:57] akosiaris: paravoid: thank you! I have abandoned the patch and resolved the task. If it happens again I guess I will revisit and investigate more [13:07:10] hashar: thanks! [13:07:41] akosiaris: and if you feel adventurous I got a patch to convert the tests from make/pp to spec files https://gerrit.wikimedia.org/r/#/c/377980/ [13:07:47] but that is tedious to revie [13:07:48] w [13:08:08] hashar: I can have a look [13:08:33] hashar: :) [13:08:56] jdlrobson: phuedx: around? There is a "Disable RelatedArticles instrumentation on all wikis" https://gerrit.wikimedia.org/r/#/c/380554/ for SWAT [13:09:20] (03PS9) 10Filippo Giunchedi: base: ability to send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) [13:09:30] jdlrobson: phuedx: it is straightforward and I dont what it can potentially break. So I guess I am just going to deploy it [13:09:57] hashar: in general, please reach out in the future if you don't understand a behavior -- it would had probably taken less time for us to help out than it did to review this :) [13:09:58] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380554 (https://phabricator.wikimedia.org/T174944) (owner: 10Jdlrobson) [13:10:20] I was scratching my head for a while because that all looked fine and couldn't understand the patch, then pinged alex JIC I wasn't seeing something etc. [13:10:25] paravoid: ack. Though I am 100% sure the variable was not set :^| [13:10:56] but even if so, you weren't sure /why/ (or when it stopped) and that's an important aspect :) [13:11:02] (03CR) 10Bmansurov: [C: 04-1] Disable RelatedArticles instrumentation on all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380554 (https://phabricator.wikimedia.org/T174944) (owner: 10Jdlrobson) [13:11:27] hashar: i'm around [13:11:32] it was scheduled for this swat? [13:11:33] yeah I usually try to figure out the original root cause. But apparently it has been broken since at least 2015 (based on some file timestamps [13:11:36] (03Merged) 10jenkins-bot: Disable RelatedArticles instrumentation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380554 (https://phabricator.wikimedia.org/T174944) (owner: 10Jdlrobson) [13:11:40] paravoid: I will do better next time :] [13:11:50] phuedx: yeah by Jon. I have +2ed it already [13:11:51] (03CR) 10jenkins-bot: Disable RelatedArticles instrumentation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380554 (https://phabricator.wikimedia.org/T174944) (owner: 10Jdlrobson) [13:12:07] hashar: i see a -1 from bmansurov at 13:11:03 utc [13:12:11] phuedx: looks like RelatedArticles A/B testing is complete [13:12:29] phuedx, that has been addressed I think [13:12:53] hashar: it's fine, don't worry :) [13:13:20] phuedx: the original patch dropped the $wg variable, but the extension has some otehr default [13:13:22] hashar: just saying, ping us if you're running into something weird and then we can figure it out together :) [13:13:43] paravoid: are you sure? I feel I find a puppet oddity on a daily basis!!! *grin* [13:13:51] will do for sure [13:14:22] Thanks hashar [13:15:08] hashar: if you deploy it then we can prove it's working fine [13:15:11] gilles: will you deploy your patch yourself or do you want me to do them ? [13:15:18] Sorry running a bit late :) [13:15:28] phuedx: jdlrobson: it is being deployed accross the cluster [13:15:34] !log hashar@tin Synchronized wmf-config/CommonSettings.php: Disable RelatedArticles instrumentation on all wikis - T174944 (duration: 00m 45s) [13:15:36] we've got https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-schema=RelatedArticles&from=now-3h&to=now [13:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:39] T174944: Disable RelatedArticles instrumentation on all wikis - https://phabricator.wikimedia.org/T174944 [13:15:41] hashar: ta [13:15:44] hashar: please do them if you have time [13:15:52] I'm in the middle of something else [13:16:04] gilles: doing them :) [13:16:29] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380441 (https://phabricator.wikimedia.org/T169522) (owner: 10Gilles) [13:16:32] hashar waits for no person [13:17:57] (03PS10) 10Filippo Giunchedi: base: ability to send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) [13:21:05] (03PS4) 10Hashar: Enable asia-specific Navigation Timing metric [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380441 (https://phabricator.wikimedia.org/T169522) (owner: 10Gilles) [13:21:18] (03CR) 10Hashar: [C: 032] Enable asia-specific Navigation Timing metric [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380441 (https://phabricator.wikimedia.org/T169522) (owner: 10Gilles) [13:21:22] (03PS1) 10Giuseppe Lavagetto: profile::docker::storage::loopback: change logic for mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/380740 [13:21:33] RECOVERY - mediawiki-installation DSH group on mw1327 is OK: OK [13:22:18] !log upgrade grafana to 4.5.2 on labmon1001 - T175980 [13:22:23] (03CR) 10Hashar: [C: 032] "The thumborUser has been set in the private repo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376043 (https://phabricator.wikimedia.org/T144479) (owner: 10Gilles) [13:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:24] T175980: Upgrade grafana to 4.5 - https://phabricator.wikimedia.org/T175980 [13:22:50] (03Merged) 10jenkins-bot: Enable asia-specific Navigation Timing metric [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380441 (https://phabricator.wikimedia.org/T169522) (owner: 10Gilles) [13:23:00] (03CR) 10jenkins-bot: Enable asia-specific Navigation Timing metric [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380441 (https://phabricator.wikimedia.org/T169522) (owner: 10Gilles) [13:23:18] 10Operations, 10monitoring, 10User-fgiunchedi: Upgrade grafana to 4.5 - https://phabricator.wikimedia.org/T175980#3635225 (10fgiunchedi) [13:23:53] (03Merged) 10jenkins-bot: Expose Thumbor swift username [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376043 (https://phabricator.wikimedia.org/T144479) (owner: 10Gilles) [13:24:14] !log hashar@tin Synchronized wmf-config/CommonSettings.php: Enable asia-specific Navigation Timing metric - T169522 (duration: 00m 47s) [13:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:20] T169522: Measure separate NavigationTiming metric(s) focused on Asia with higher sampling - https://phabricator.wikimedia.org/T169522 [13:24:52] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: convert tests to spec [puppet] - 10https://gerrit.wikimedia.org/r/377980 (owner: 10Hashar) [13:24:57] (03PS2) 10Alexandros Kosiaris: puppetmaster: convert tests to spec [puppet] - 10https://gerrit.wikimedia.org/r/377980 (owner: 10Hashar) [13:25:06] (03CR) 10Alexandros Kosiaris: [C: 032] profile::docker::storage::loopback: change logic for mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/380740 (owner: 10Giuseppe Lavagetto) [13:25:28] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/8033/" [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [13:25:30] (03CR) 10jenkins-bot: Expose Thumbor swift username [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376043 (https://phabricator.wikimedia.org/T144479) (owner: 10Gilles) [13:25:33] !log hashar@tin Synchronized wmf-config/filebackend.php: Expose Thumbor swift username - T144479 (duration: 00m 44s) [13:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:38] T144479: Ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479 [13:25:42] <_joe_> akosiaris: wait [13:25:47] <_joe_> I changed something more [13:25:51] jdlrobson: it took ~2 days for the level of popups events to get to > 3 consistently and we're still seeing 1 or 2 events every now and again [13:25:53] <_joe_> did you merge theat? [13:25:54] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppetmaster: convert tests to spec [puppet] - 10https://gerrit.wikimedia.org/r/377980 (owner: 10Hashar) [13:25:59] (03CR) 10Andrew Bogott: sshd: skip LDAP lookup for the root user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/380717 (https://phabricator.wikimedia.org/T176609) (owner: 10Volans) [13:26:02] <_joe_> do NOT merge mine [13:26:17] jdlrobson: i expect relatedarticles will behave the same: the majority of clients get the change immediately [13:26:19] 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team, 10Thumbor, and 2 others: Ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#3635238 (10Gilles) 05Open>03Resolved The change will only be effective once the train has... [13:26:30] phuedx: is that due to some javascript being cached on the client side? [13:26:36] (03PS2) 10Giuseppe Lavagetto: profile::docker::storage::loopback: change logic for mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/380740 [13:26:48] hashar: that and people don't close their browser windows ever [13:26:50] MONSTERS [13:26:54] gilles: I have deployed both your patches. [13:27:02] hashar: thanks! [13:27:11] we should create a script that crashes your browser if you leave it open too long ;-) [13:27:13] (03CR) 10Filippo Giunchedi: "@Alex, I've reworked the approach to be incremental because I realized the previous approach wasn't very flexible (hardcording syslog tcp " [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [13:27:13] phuedx: oh I know those. James_F surely never closes his browser :] [13:27:20] pretty graphs: https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-schema=RelatedArticles&from=now-30m&to=now [13:27:46] jdlrobson: making it reload the page would be enough! we could really do that if a tab gets focus after having been out of focus for days [13:27:49] Hey. Every month or two I do. :-) [13:27:50] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::docker::storage::loopback: change logic for mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/380740 (owner: 10Giuseppe Lavagetto) [13:28:05] (03CR) 10Volans: "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/380717 (https://phabricator.wikimedia.org/T176609) (owner: 10Volans) [13:29:20] gilles: but they need to learn (!!!) [13:29:23] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 35 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[docker] [13:29:59] jdlrobson: an in-page popup that states they should probably reload the page, then [13:30:05] <_joe_> ? [13:30:17] <_joe_> that alert is bogus [13:30:23] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [13:33:57] (03PS2) 10Andrew Bogott: sshd: skip LDAP lookup for the root user [puppet] - 10https://gerrit.wikimedia.org/r/380717 (https://phabricator.wikimedia.org/T176609) (owner: 10Volans) [13:34:55] (03CR) 10Andrew Bogott: [C: 032] sshd: skip LDAP lookup for the root user [puppet] - 10https://gerrit.wikimedia.org/r/380717 (https://phabricator.wikimedia.org/T176609) (owner: 10Volans) [13:34:57] i just don't understand these people [13:35:55] (03PS2) 10Filippo Giunchedi: prometheus: add analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/378716 (https://phabricator.wikimedia.org/T175922) [13:36:51] andrewbogott: are you also puppet-merging it? [13:36:58] yes [13:37:08] thanks! [13:39:16] !log Deploy alter table on s6 master (with replication enable, so lag will be generated) db2028 - T163979 [13:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:21] T163979: Convert unique keys into primary keys for some wiki tables on s6-eqiad and codfw - https://phabricator.wikimedia.org/T163979 [13:41:28] (03PS9) 10ArielGlenn: Move dataset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) [13:45:12] (03PS1) 10Elukey: role::prometheus::ops: add kafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/380744 (https://phabricator.wikimedia.org/T175922) [13:49:40] (03CR) 10Hoo man: "Currently we fetch the 100 next entities at a time and then discard about 4/5 of these (as they don't belong to the current shard). Afterw" [puppet] - 10https://gerrit.wikimedia.org/r/380628 (owner: 10Hoo man) [13:51:46] (03CR) 10Elukey: [C: 031] "Looking forward to add metrics to it :) Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/378716 (https://phabricator.wikimedia.org/T175922) (owner: 10Filippo Giunchedi) [13:51:49] RECOVERY - mediawiki-installation DSH group on mw1328 is OK: OK [13:58:01] (03PS2) 10Elukey: role::prometheus::ops: add kafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/380744 (https://phabricator.wikimedia.org/T175922) [14:00:53] 10Operations, 10Contributors-Team, 10MobileFrontend, 10MW-1.31-release-notes (WMF-deploy-2017-09-26 (1.31.0-wmf.1)), and 2 others: Diff page produces 503 on first visit - https://phabricator.wikimedia.org/T176637#3632060 (10phuedx) @Jdlrobson: Was/is this specific to the Beta Cluster? I can't reproduce thi... [14:01:28] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to production bastions for cwdent - https://phabricator.wikimedia.org/T176529#3635443 (10herron) @Dzahn I'd be happy to merge this tomorrow (weds) pending no objections. Feel free to assign to me if you'd like. [14:01:46] (03PS1) 10Gilles: Upgrade to 1.6 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/380747 (https://phabricator.wikimedia.org/T172556) [14:04:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "overall looks good, some ruby comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/374527 (owner: 10Hashar) [14:05:38] (03PS1) 10Marostegui: db-codfw.php: Depool db2072 and db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380749 (https://phabricator.wikimedia.org/T174509) [14:07:18] 10Operations, 10Ops-Access-Requests: Requesting access to pingback data for cicalese - https://phabricator.wikimedia.org/T176749#3635454 (10cicalese) [14:08:03] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2072 and db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380749 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [14:09:42] 10Operations, 10Contributors-Team, 10MobileFrontend, 10Readers-Web-Backlog (Tracking), 10Release-Engineering-Team (Watching / External): Diff page produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3635473 (10Jdlrobson) [14:10:22] 10Operations, 10Contributors-Team, 10MobileFrontend, 10Readers-Web-Backlog (Tracking), 10Release-Engineering-Team (Watching / External): Diff page consistently produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3632060 (10Jdlrobson) [14:10:29] PROBLEM - puppet last run on mw2131 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apache2] [14:10:44] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2072 and db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380749 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [14:10:48] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to production bastions for cwdent - https://phabricator.wikimedia.org/T176529#3628771 (10MoritzMuehlenhoff) @herron: We really don't need a waiting period here. Casey had that access before and waived it at https://phabricator.wikimed... [14:10:53] (03CR) 10jenkins-bot: db-codfw.php: Depool db2072 and db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380749 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [14:11:17] (03CR) 10Hashar: apt: spec boiler plate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/374527 (owner: 10Hashar) [14:11:38] (03PS3) 10Hashar: apt: spec boiler plate [puppet] - 10https://gerrit.wikimedia.org/r/374527 [14:11:40] (03PS3) 10Giuseppe Lavagetto: Add support for build containers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379792 [14:11:42] (03PS2) 10Giuseppe Lavagetto: Add proxy support to the build process itself [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/380696 [14:11:44] (03PS3) 10Giuseppe Lavagetto: Add ruby base image and a fluentd image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379793 [14:11:46] (03PS2) 10Giuseppe Lavagetto: Add image for running fluentd as a daemonset in kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/380697 [14:12:03] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2071 and db2072 to optimize templatelinks and pagelinks tables - T174509 (duration: 00m 44s) [14:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:08] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [14:12:51] godog: what can you tell me about the 'graphite' labs project? I'm working on fixing puppet on a bunch of those VMs but don't want to fix things if they're unused [14:13:48] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3635511 (10jcrespo) @faidon, cool! Less work for us :-) As you imagine, I didn't... [14:14:28] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3635515 (10jcrespo) @Joe so s/puppet patch/mediawiki config patch/ :-) [14:14:37] (03CR) 10Alexandros Kosiaris: [C: 032] apt: spec boiler plate [puppet] - 10https://gerrit.wikimedia.org/r/374527 (owner: 10Hashar) [14:14:43] (03PS4) 10Alexandros Kosiaris: apt: spec boiler plate [puppet] - 10https://gerrit.wikimedia.org/r/374527 (owner: 10Hashar) [14:14:48] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] apt: spec boiler plate [puppet] - 10https://gerrit.wikimedia.org/r/374527 (owner: 10Hashar) [14:14:50] andrewbogott: on which instances? [14:15:08] (03CR) 10Ottomata: [C: 031] "Talked to Luca, we are going to address my fears in the next patch :) This should be a no-op as is." [puppet] - 10https://gerrit.wikimedia.org/r/380449 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [14:15:15] 10Operations, 10Ops-Access-Requests: Requesting access to pingback data for cicalese - https://phabricator.wikimedia.org/T176749#3635454 (10CCicalese_WMF) I inadvertently created this ticket from my non-WMF Phab account. [14:15:34] godog: we're trying to get cumin to work, which requires a clean puppet run. If you look at https://etherpad.wikimedia.org/p/wmcs-cumin-fail you'll see that most graphite instances can't be reached [14:15:34] PROBLEM - mysqld processes on labsdb1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [14:15:50] <_joe_> uh? [14:15:53] no idea [14:16:09] (03PS6) 10Alexandros Kosiaris: apt:pin pref file must not have space [puppet] - 10https://gerrit.wikimedia.org/r/353540 (owner: 10Hashar) [14:16:12] (03CR) 10Alexandros Kosiaris: [C: 032] apt:pin pref file must not have space [puppet] - 10https://gerrit.wikimedia.org/r/353540 (owner: 10Hashar) [14:16:14] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] apt:pin pref file must not have space [puppet] - 10https://gerrit.wikimedia.org/r/353540 (owner: 10Hashar) [14:16:27] godog: I'm working on graphite1 now [14:16:35] (03PS6) 10Elukey: Introduce profile::druid::worker [puppet] - 10https://gerrit.wikimedia.org/r/380449 (https://phabricator.wikimedia.org/T176223) [14:17:12] mysql was killed by oom [14:17:32] marostegui: ack, thoughts on course of action? [14:17:43] bring it back? :) [14:17:47] heh [14:17:49] hm, lots of these systems that haven't run puppet for a while are giving me '$facts["numa"] is not a hash or array' [14:18:27] marostegui: I'll let you? [14:18:37] chasemp: memory has been growing a lot since 14:00 or so: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=labsdb1005&var-network=bond0 [14:18:40] marostegui: can I PM? [14:18:41] chasemp: yep, i will do it [14:18:57] tabbycat: yep, busy now with labsdb1005, is it related? [14:18:57] marostegui: oh ugly and thanks [14:19:16] marostegui: not sure, it's about outdated replicas [14:19:26] tabbycat: pm and I will reply later :) [14:19:28] on the new servers [14:19:31] okay [14:19:33] andrewbogott: ack, afaik they can all be nuked [14:19:43] RECOVERY - mysqld processes on labsdb1005 is OK: PROCS OK: 1 process with command name mysqld [14:19:45] godog: the whole project, or just the things on that list? [14:19:45] godog: nice :) [14:19:54] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8038/" [puppet] - 10https://gerrit.wikimedia.org/r/380449 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [14:19:58] (which is maybe everything, I haven't checked) [14:20:33] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add support for build containers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379792 (owner: 10Giuseppe Lavagetto) [14:20:50] andrewbogott: I don't know about the whole project tbh, maybe nuke what's in the pad and leave the project [14:20:53] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add proxy support to the build process itself [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/380696 (owner: 10Giuseppe Lavagetto) [14:21:08] godog: ok! (btw, the pad has every vm in the project it turns out) [14:21:17] chasemp: hehe autumn cleanups [14:21:30] lol [14:21:41] chasemp: should be fine now [14:22:00] thanks godog! [14:23:49] np! [14:23:57] 10Operations, 10Mail, 10Surveys: Qualtrics email-LDAP issue - https://phabricator.wikimedia.org/T176666#3633265 (10herron) FWIW I do see messages from the qualtrics mail system being relayed through the wikimedia.org MX and onward to google. Are there any examples of errors/bounces relating to this issue av... [14:24:18] (03PS5) 10Rush: wmcs: Add wikireplica_dns management script [puppet] - 10https://gerrit.wikimedia.org/r/378739 (https://phabricator.wikimedia.org/T174860) (owner: 10BryanDavis) [14:26:30] !log mwscript refreshLinks.php --wiki elwiki --namespace 0 on terbium has finished (T151717) [14:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:36] T151717: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717 [14:27:15] !log mobrovac@tin Started deploy [restbase/deploy@f989dd9]: Remove the old mobileapps module completely - T169940 [14:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:25] T169940: End of September milestone: Start migration of production use cases. - https://phabricator.wikimedia.org/T169940 [14:27:31] (03Abandoned) 10Paladox: phab/aphlict: move group up a bit in code [puppet] - 10https://gerrit.wikimedia.org/r/379436 (owner: 10Paladox) [14:27:43] (03PS6) 10Paladox: Phabricator: Remove ubuntu / upstart support [puppet] - 10https://gerrit.wikimedia.org/r/379794 [14:28:02] (03PS17) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 [14:28:15] (03PS5) 10Paladox: Gerrit: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379136 [14:28:35] (03PS4) 10Paladox: Gerrit: Enable ui for slaves [puppet] - 10https://gerrit.wikimedia.org/r/379420 [14:31:25] (03PS6) 10Rush: wmcs: Add wikireplica_dns management script [puppet] - 10https://gerrit.wikimedia.org/r/378739 (https://phabricator.wikimedia.org/T174860) (owner: 10BryanDavis) [14:32:39] (03CR) 10Rush: [C: 032] wmcs: Add wikireplica_dns management script [puppet] - 10https://gerrit.wikimedia.org/r/378739 (https://phabricator.wikimedia.org/T174860) (owner: 10BryanDavis) [14:36:55] !log Optimize tables pagelinks and templatelinks on db2071 and db2072 from s1 - T174509 [14:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:00] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [14:37:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comments, rest looks ok" (036 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379793 (owner: 10Giuseppe Lavagetto) [14:38:25] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3635607 (10jcrespo) In fact, as this is not going to be enabled on all hosts yet... [14:38:59] RECOVERY - puppet last run on mw2131 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:39:32] (03PS4) 10Rush: tools: Disable IPv6 lookup for static reverse proxy backends [puppet] - 10https://gerrit.wikimedia.org/r/380318 (owner: 10BryanDavis) [14:40:35] (03CR) 10Rush: [C: 032] tools: Disable IPv6 lookup for static reverse proxy backends [puppet] - 10https://gerrit.wikimedia.org/r/380318 (owner: 10BryanDavis) [14:47:50] !log mobrovac@tin Finished deploy [restbase/deploy@f989dd9]: Remove the old mobileapps module completely - T169940 (duration: 20m 35s) [14:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:55] T169940: End of September milestone: Start migration of production use cases. - https://phabricator.wikimedia.org/T169940 [14:54:08] (03PS4) 10Giuseppe Lavagetto: Add ruby base image and a fluentd image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379793 [14:54:10] (03PS3) 10Giuseppe Lavagetto: Add image for running fluentd as a daemonset in kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/380697 [14:56:13] (03PS1) 10Volans: OpenStack backend: set default query params in config [software/cumin] - 10https://gerrit.wikimedia.org/r/380760 (https://phabricator.wikimedia.org/T176314) [14:56:20] PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 625.53 seconds [14:56:30] i thought i downtimed that one [14:56:42] i will do it now [14:56:49] thanks it was maintenance [14:57:01] too many surprises today :-) [14:57:05] yep, i downtimed the whole s6 in codfw [14:57:10] but maybe missed that host [14:57:22] (03PS1) 10Andrew Bogott: validatelabsfqdn.py: Make check case-insensitive [puppet] - 10https://gerrit.wikimedia.org/r/380761 [14:57:22] don't tell me we are missing downs again! [14:57:26] not maybe, i clearly did :) [14:59:14] (03CR) 10Alexandros Kosiaris: [C: 032] Add /data/ Redirect for commons [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [14:59:22] (03PS5) 10Alexandros Kosiaris: Add /data/ Redirect for commons [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [14:59:24] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add /data/ Redirect for commons [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [14:59:53] (03CR) 10Gehel: [C: 031] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/380760 (https://phabricator.wikimedia.org/T176314) (owner: 10Volans) [15:01:26] (03CR) 10Giuseppe Lavagetto: [C: 031] "Nitpick, but lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [15:02:29] RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 6.16 seconds [15:05:27] (03PS1) 10Elukey: role::kafka::jumbo::broker: rename cluster hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/380763 (https://phabricator.wikimedia.org/T175922) [15:06:23] !log restarting blazegraph on all wdqs nodes for heap resize - T175919 [15:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:28] T175919: investigate GC times on wikidata query service - https://phabricator.wikimedia.org/T175919 [15:08:07] (03PS2) 10Elukey: role::kafka::jumbo::broker: rename cluster hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/380763 (https://phabricator.wikimedia.org/T175922) [15:08:23] (03PS10) 10Paladox: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 [15:11:40] (03PS1) 10Ayounsi: Reserve frack-management1-c-eqiad - 10.64.40.192/26 [dns] - 10https://gerrit.wikimedia.org/r/380767 [15:14:41] (03PS1) 10Gehel: wdqs: reduce blazegraph heap size to 12GB [puppet] - 10https://gerrit.wikimedia.org/r/380768 (https://phabricator.wikimedia.org/T175919) [15:16:41] (03CR) 10Gehel: [C: 032] wdqs: reduce blazegraph heap size to 12GB [puppet] - 10https://gerrit.wikimedia.org/r/380768 (https://phabricator.wikimedia.org/T175919) (owner: 10Gehel) [15:18:42] !log starting eqiad frack switch to new infra - T174218 [15:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:49] T174218: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218 [15:21:30] (03CR) 10Elukey: [C: 032] role::kafka::jumbo::broker: rename cluster hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/380763 (https://phabricator.wikimedia.org/T175922) (owner: 10Elukey) [15:21:36] (03PS3) 10Elukey: role::kafka::jumbo::broker: rename cluster hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/380763 (https://phabricator.wikimedia.org/T175922) [15:22:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [15:23:40] PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 67, down: 1, dormant: 0, excluded: 1, unused: 0 [15:24:49] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 68, down: 0, dormant: 0, excluded: 1, unused: 0 [15:26:04] (03PS4) 10Elukey: role::kafka::jumbo::broker: rename cluster hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/380763 (https://phabricator.wikimedia.org/T175922) [15:26:19] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380689 (https://phabricator.wikimedia.org/T176709) (owner: 10Jayprakash12345) [15:27:21] (03CR) 10Filippo Giunchedi: [C: 031] role::kafka::jumbo::broker: rename cluster hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/380763 (https://phabricator.wikimedia.org/T175922) (owner: 10Elukey) [15:28:32] (03CR) 10MarcoAurelio: [C: 031] "Looks okay now. Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380689 (https://phabricator.wikimedia.org/T176709) (owner: 10Jayprakash12345) [15:28:38] (03CR) 10Elukey: [C: 032] role::kafka::jumbo::broker: rename cluster hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/380763 (https://phabricator.wikimedia.org/T175922) (owner: 10Elukey) [15:29:06] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3635865 (10aaron) >>! In T175672#3635140, @Joe wrote: > Looking at http://php.ne... [15:30:24] (03PS2) 10Elukey: phragile: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/379499 [15:30:56] (03CR) 10Elukey: [C: 032] phragile: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/379499 (owner: 10Elukey) [15:32:15] (03Abandoned) 10Elukey: librenms::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378858 (owner: 10Elukey) [15:33:13] (03CR) 10BryanDavis: "As feared by Merlijn, this does not apparently fix the problem. Still seeing errors like:" [puppet] - 10https://gerrit.wikimedia.org/r/380318 (owner: 10BryanDavis) [15:33:55] (03PS3) 10Elukey: role::prometheus::ops: add kafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/380744 (https://phabricator.wikimedia.org/T175922) [15:34:10] (03CR) 10Ottomata: "Cool! If we don't need the monitoring.yaml thing (since we aren't using Ganglia), maybe we should remove it?" [puppet] - 10https://gerrit.wikimedia.org/r/380763 (https://phabricator.wikimedia.org/T175922) (owner: 10Elukey) [15:39:29] PROBLEM - Host cp2012 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:29] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [15:43:01] ema, bblack ^^ cp2012 is it you? [15:43:25] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp2012_v4, cp2012_v6 [15:43:45] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp2012_v4, cp2012_v6 [15:43:45] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp2012_v4, cp2012_v6 [15:46:44] ema, bblack: the only thing I can see in console is: Enumerat [15:47:12] volans: not me, can you powercycle the host as you're logged in already? [15:47:18] sure [15:47:22] thanks [15:47:59] ema: done, do you want to downtime it? [15:49:41] ema: booting [15:49:43] is it depooled? [15:49:54] at login [15:50:05] RECOVERY - Host cp2012 is UP: PING OK - Packet loss = 0%, RTA = 36.03 ms [15:50:26] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [15:50:35] (03PS1) 10Ayounsi: Update eqiad frack monitoring [puppet] - 10https://gerrit.wikimedia.org/r/380772 [15:50:55] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK [15:50:55] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [15:50:56] !log powecycled cp2012, no ping or ssh console stuck [15:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:12] interesting, I don't see any error in the kernel logs [15:51:57] (03CR) 10Ayounsi: [C: 032] Update eqiad frack monitoring [puppet] - 10https://gerrit.wikimedia.org/r/380772 (owner: 10Ayounsi) [15:52:06] (03PS2) 10Ayounsi: Update eqiad frack monitoring [puppet] - 10https://gerrit.wikimedia.org/r/380772 [15:52:19] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to production bastions for cwdent - https://phabricator.wikimedia.org/T176529#3635957 (10Dzahn) a:05Dzahn>03herron Sure, thanks :) [15:56:41] (03PS4) 10Elukey: role::prometheus::ops: add kafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/380744 (https://phabricator.wikimedia.org/T175922) [15:57:58] (03PS1) 10Volans: Configuration: allow an empty aliases.yaml file [software/cumin] - 10https://gerrit.wikimedia.org/r/380773 [15:58:07] 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team, 10Thumbor, and 2 others: Ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#2601048 (10Krenair) >>! In T144479#3579083, @Gilles wrote: > This requires exposing the Thumbo... [15:59:40] (03CR) 10Ayounsi: [C: 032] Reserve frack-management1-c-eqiad - 10.64.40.192/26 [dns] - 10https://gerrit.wikimedia.org/r/380767 (owner: 10Ayounsi) [16:00:05] godog, moritzm, and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170926T1600). [16:00:05] matthiasmullie, tabbycat, and Krinkle: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:10] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline for naming" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/380744 (https://phabricator.wikimedia.org/T175922) (owner: 10Elukey) [16:00:17] here! [16:01:14] matthiasmullie: looking at your patch [16:01:46] (03PS4) 10Filippo Giunchedi: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/378233 (https://phabricator.wikimedia.org/T160185) (owner: 10Matthias Mullie) [16:02:36] (03PS1) 10Lucas Werkmeister (WMDE): Fix /data/ redirect for commons [puppet] - 10https://gerrit.wikimedia.org/r/380774 (https://phabricator.wikimedia.org/T163922) [16:02:40] in the meantime if other puppet swat volunteers would like to pick other swat patches I'd appreciate it! [16:03:16] (03CR) 10Filippo Giunchedi: [C: 032] Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/378233 (https://phabricator.wikimedia.org/T160185) (owner: 10Matthias Mullie) [16:05:11] checking other patches [16:05:38] second one already merged [16:07:16] (03PS1) 10Papaul: DNS: Add mgmt DNS entries for furud Bug:T176506 [dns] - 10https://gerrit.wikimedia.org/r/380776 [16:08:05] the webperf patches to navtiming.py looks good, I followed the issue and I think we can merge [16:08:26] Krinkle: ^ [16:08:54] Okay. I'm running late 15min. [16:08:58] Feel free to skip me for now. [16:08:58] 10Operations, 10Operations-Software-Development, 10Patch-For-Review, 10Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#3636009 (10Pchelolo) We've had an instance of this vandalism-related alert again and the `service-checker-swagger` wasn'... [16:08:59] Sorry [16:09:48] (03CR) 10Dzahn: [C: 032] Make it easy to set PHP ini flags with mwscript [puppet] - 10https://gerrit.wikimedia.org/r/378007 (owner: 10Aaron Schulz) [16:09:56] matthiasmullie: waiting for puppet to finish running [16:09:57] here you go, one. and another was done already [16:10:13] (03PS3) 10Dzahn: Make it easy to set PHP ini flags with mwscript [puppet] - 10https://gerrit.wikimedia.org/r/378007 (owner: 10Aaron Schulz) [16:10:48] neat, thanks mutante [16:13:30] alright [16:13:52] jouncebot: next [16:13:52] In 0 hour(s) and 46 minute(s): Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170926T1700) [16:14:22] I am a bit hesitant about hhvm: Set LANG=C.UTF-8 [16:14:34] has the change been discussed with ops? [16:16:24] 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team, 10Thumbor, and 2 others: Ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#3636030 (10Gilles) >>! In T144479#3635968, @Krenair wrote: >>>! In T144479#3579083, @Gilles wr... [16:16:28] 10Operations, 10ops-codfw: rack/setup/install furud.codfw.wmnet - https://phabricator.wikimedia.org/T176506#3636031 (10RobH) IRC update: Please name/label the md1400s as furud-array1 and furud-array2. Thanks! [16:18:36] (03PS1) 10Mforns: Remove reportupdater job that triggers abandoned discovery-stats [puppet] - 10https://gerrit.wikimedia.org/r/380778 (https://phabricator.wikimedia.org/T176639) [16:21:11] (03PS5) 10Elukey: role::prometheus::ops: add kafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/380744 (https://phabricator.wikimedia.org/T175922) [16:23:29] 10Operations, 10Traffic: Server hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T156033#3636060 (10BBlack) [16:23:32] 10Operations, 10Traffic: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3636059 (10BBlack) [16:23:35] 10Operations, 10Traffic: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#3636061 (10BBlack) [16:23:38] 10Operations, 10Traffic: Select site vendor for Asia Cache Datacenter - https://phabricator.wikimedia.org/T156030#3636056 (10BBlack) 05Open>03Resolved a:03BBlack Equinix was selected by the process, and we've negotiated and signed and sent in the specific order at this point. [16:23:41] (03CR) 10Gehel: Configuration: allow an empty aliases.yaml file (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/380773 (owner: 10Volans) [16:24:40] 10Operations, 10Traffic: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3636068 (10BBlack) [16:24:42] 10Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#3636069 (10BBlack) [16:24:45] 10Operations, 10Traffic: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#3636070 (10BBlack) [16:24:48] 10Operations, 10Traffic: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#2962007 (10BBlack) 05Open>03Resolved a:03BBlack `eqsin` is the site name (Vendor: Equinix, Airport: [[ https://en.wikipedia.org/wiki/Singapore_Changi_Airport | SIN ]] ) [16:25:05] matthiasmullie: all done, you should be able to scap-deploy [16:27:24] 10Operations, 10ops-codfw: rack/setup/install furud.codfw.wmnet - https://phabricator.wikimedia.org/T176506#3636075 (10Papaul) [16:27:33] godog oh shoot - how do I do that? :) [16:29:08] matthiasmullie: same way as deployment-prep / beta, but from tin.eqiad.wmnet [16:31:39] godog alright - should I go ahead and do that now, or can I do it tomorrow (we've got a 3D deploy window then) [16:31:51] godog: I'm here now if you're okay deploying those webperf patches [16:32:10] matthiasmullie: let's do it now so we know it works [16:32:31] Krinkle: ack, elukey above had a question about your LANG patch [16:32:56] elukey: The HHVM patch is an optimisation from Tim. [16:33:05] elukey: MediaWiki sets it on every request in Setup.php [16:33:17] Having it set by default means setlocale() will be a no-op at runtime [16:33:32] moves the overhead from 1 per request to 1 per hhvm startup [16:33:53] https://github.com/wikimedia/mediawiki/blob/master/includes/Setup.php#L53-L54 [16:34:08] https://gerrit.wikimedia.org/r/#/c/352861/ [16:34:16] sure sure, I got that but it seemed a bit risky so I asked since I didn't have all the context (risky == affects all the appservers) [16:34:37] elukey: Right. We can cherry pick it to beta cluster first perhaps [16:34:45] I don't have a reason not to :) [16:34:46] that would be massive [16:35:11] massive = good? [16:36:00] Aye, more than 20 patches cherry-picked there [16:36:02] yep (bad italian-english) [16:36:10] No worries :) [16:36:21] 10Operations, 10ops-codfw: rack/setup/install furud.codfw.wmnet - https://phabricator.wikimedia.org/T176506#3636109 (10Papaul) a:05Papaul>03RobH switch port information asw-a7-codfw xe-7/0/6 Note: I did not wiring the shelves according to the diagram because for that type of wiring, I need 4x12Gb HD-min... [16:36:43] godog: We can do the webperf patches meanwhile, I'm happy to defer the hhvm one to next week. No need to rush that one. [16:36:57] Would be good actually to allow some time in testing for that one [16:37:01] +1 [16:37:40] (03CR) 10Krinkle: [C: 031] "Cherry-picked to Beta Cluster. Rescheduled prod applying to next week." [puppet] - 10https://gerrit.wikimedia.org/r/353228 (https://phabricator.wikimedia.org/T107128) (owner: 10Tim Starling) [16:37:47] Krinkle: ack, I'll look at those [16:38:13] godog scap deploy went well; as for running puppet on all targets, what's the best way to do that? [16:38:24] I read about cumin, but not sure exactly how that works :p [16:38:33] (03PS6) 10Elukey: role::prometheus::ops: add kafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/380744 (https://phabricator.wikimedia.org/T175922) [16:38:49] matthiasmullie: I think that Filippo already took care of that part for you :) [16:39:34] (or if you need another puppet run after the deploy, it will be done during the next hour automatically on all hosts) [16:39:38] matthiasmullie: yeah what elukey said, that's done already [16:40:03] matthiasmullie: ok so if scap deploy works that's great! all set [16:40:27] oh alright, perfect :) [16:40:38] thanks! [16:42:05] (03PS6) 10Filippo Giunchedi: webperf: Limit by-country navtiming breakdown to those with 5+ hits/min [puppet] - 10https://gerrit.wikimedia.org/r/377806 (https://phabricator.wikimedia.org/T166390) (owner: 10Krinkle) [16:43:07] Krinkle: I'm merging all but https://gerrit.wikimedia.org/r/#/c/379830/ which I think should get a +1 from _joe_ since he recently refactored tox/rake/etc [16:43:25] (03CR) 10Filippo Giunchedi: [C: 032] webperf: Limit by-country navtiming breakdown to those with 5+ hits/min [puppet] - 10https://gerrit.wikimedia.org/r/377806 (https://phabricator.wikimedia.org/T166390) (owner: 10Krinkle) [16:43:34] godog: He reviewed it yesteday in _security. Helped me to make it work. [16:44:10] Krinkle: ah, I'll merge that too then [16:44:57] (03CR) 10Bearloga: [C: 031] Remove reportupdater job that triggers abandoned discovery-stats [puppet] - 10https://gerrit.wikimedia.org/r/380778 (https://phabricator.wikimedia.org/T176639) (owner: 10Mforns) [16:45:24] (03CR) 10Filippo Giunchedi: [C: 032] webperf: Fix crash when event contains browser_major:null [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) (owner: 10Krinkle) [16:46:12] (03PS11) 10Filippo Giunchedi: webperf: Add navtiming tests to puppet.git:/tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/379830 (owner: 10Krinkle) [16:47:27] (03CR) 10Filippo Giunchedi: [C: 032] webperf: Add navtiming tests to puppet.git:/tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/379830 (owner: 10Krinkle) [16:48:02] (03PS10) 10Filippo Giunchedi: webperf: Fix crash when event contains browser_major:null [puppet] - 10https://gerrit.wikimedia.org/r/379820 (https://phabricator.wikimedia.org/T176149) (owner: 10Krinkle) [16:49:50] Krinkle: all merged, ok to force puppet or you'll take a look too? [16:50:43] (03PS1) 10Bearloga: profile::discovery_dashboards: Add daily forecasts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/380786 (https://phabricator.wikimedia.org/T112170) [16:59:15] Krinkle: I'm off but iirc you are on perf-roots anyways [16:59:26] Yep [16:59:43] godog: Yeah, I'll let puppet do its run and monitor from there. [16:59:48] Thanks! [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: #bothumor I � Unicode. All rise for Services – Graphoid / Parsoid / OCG / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170926T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:15] no parsoid deploy today [17:04:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [17:07:27] (03CR) 10Merlijn van Deen: [C: 031] "I suppose that afterwards there should be a mass-reinstall of the misctools package to re-replace the sql script?" [puppet] - 10https://gerrit.wikimedia.org/r/380685 (https://phabricator.wikimedia.org/T176688) (owner: 10BryanDavis) [17:07:39] 10Operations, 10Ops-Access-Requests, 10Research: Server access for Miriam Redi - https://phabricator.wikimedia.org/T176682#3636194 (10herron) a:03herron Hi @Miriam! Once you have provided your dedicated prod ssh key I will prep a patch to grant the requested access. Normally there is a 3 day waiting peri... [17:07:39] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [17:09:07] (03PS1) 10Chad: group0 to 1.31.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380789 [17:12:10] !log demon@tin Started scap: 1.31.0-wmf.1 bootstrap + testwiki [17:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [17:15:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [17:27:39] PROBLEM - puppet last run on mw2140 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apache2] [17:33:27] 10Operations, 10ops-esams, 10DC-Ops, 10netops, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636236 (10RobH) @ayounsi fixed the cr3, and we have a serials from it: JN1263160AF, ACRH4681, and CAHX6845. Updated task description to remove t... [17:33:55] 10Operations, 10ops-esams, 10DC-Ops, 10netops, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636238 (10RobH) [17:35:06] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3636240 (10jcrespo) I can connect using python just by pointing to the CA cert:... [17:36:25] (03Draft2) 10Jayprakash12345: Enable ArticlePlaceholder on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380793 (https://phabricator.wikimedia.org/T176771) [17:37:40] 10Operations, 10ops-esams, 10DC-Ops, 10netops, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636242 (10faidon) @mark confirmed that S/N: TA3717090152 and S/N: TA3717090331 are the new QFX5100 that were delivered at esams a few weeks ago (... [17:39:36] (03CR) 10Jayprakash12345: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380793 (https://phabricator.wikimedia.org/T176771) (owner: 10Jayprakash12345) [17:45:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3636251 (10ayounsi) == Failover tests== Rate: 1 ping per second |Tested item|ping from bast1001 to tellurium|ping from external to pfw3|ping from rigel to tellurium|Notes| |cr1 – pfw3a... [17:46:37] (03PS10) 10ArielGlenn: Move dataset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) [17:48:31] (03CR) 10ArielGlenn: [C: 032] Move dataset nfs server manifests to dump module [puppet] - 10https://gerrit.wikimedia.org/r/380721 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [17:48:59] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [17:49:59] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [17:50:49] (03CR) 10Phedenskog: "Looks ok but needs your eyes too Krinkle." [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [17:52:11] (03PS2) 10Volans: Configuration: do not raise on empty configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/380773 [17:52:12] (03PS1) 10Volans: Exceptions: convert remaining spuriours ones [software/cumin] - 10https://gerrit.wikimedia.org/r/380798 [17:52:22] (03CR) 10Phedenskog: "And needs to get rebased, let me try that tonight when the kids are asleep." [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [17:53:03] (03CR) 10Volans: "reply inline" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/380773 (owner: 10Volans) [17:55:41] (03CR) 10Volans: [C: 032] OpenStack backend: set default query params in config [software/cumin] - 10https://gerrit.wikimedia.org/r/380760 (https://phabricator.wikimedia.org/T176314) (owner: 10Volans) [17:56:00] RECOVERY - puppet last run on mw2140 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:57:56] (03Merged) 10jenkins-bot: OpenStack backend: set default query params in config [software/cumin] - 10https://gerrit.wikimedia.org/r/380760 (https://phabricator.wikimedia.org/T176314) (owner: 10Volans) [18:00:06] !log demon@tin Finished scap: 1.31.0-wmf.1 bootstrap + testwiki (duration: 47m 56s) [18:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:49] 10Operations, 10Ops-Access-Requests: Requesting access to pingback data for cicalese - https://phabricator.wikimedia.org/T176749#3635454 (10herron) Hi @CCicalese_WMF! Could you please clarify if the group needed is `statistics-privatedata-users`? The group `analytics-privatedata-users` does not appear to be... [18:03:11] (03CR) 10BryanDavis: "> I suppose that afterwards there should be a mass-reinstall of the" [puppet] - 10https://gerrit.wikimedia.org/r/380685 (https://phabricator.wikimedia.org/T176688) (owner: 10BryanDavis) [18:03:43] (03PS1) 10Ottomata: Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) [18:04:06] (03CR) 10jerkins-bot: [V: 04-1] Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata) [18:05:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10procurement: eqiad: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176338#3636284 (10RobH) >>! In T176338#3624714, @Cmjohnson wrote: > @robh I want to confirm that I do have 2 spares still in their original packaging in... [18:06:36] 10Operations, 10ops-esams, 10DC-Ops, 10netops, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636287 (10RobH) [18:07:20] 10Operations, 10ops-esams, 10DC-Ops, 10netops, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3621935 (10RobH) [18:07:27] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Reimage gerrit2001 as stretch - https://phabricator.wikimedia.org/T168562#3636290 (10Paladox) I guess we can close this as resolved now? [18:08:35] 10Operations, 10Gerrit, 10Release-Engineering-Team: Reimage cobalt as stretch - https://phabricator.wikimedia.org/T176774#3636304 (10Paladox) [18:09:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3636318 (10ayounsi) [18:10:28] (03PS1) 10Ottomata: [WIP] Set up separate druid public-eqiad cluster. [puppet] - 10https://gerrit.wikimedia.org/r/380804 (https://phabricator.wikimedia.org/T176223) [18:11:00] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Set up separate druid public-eqiad cluster. [puppet] - 10https://gerrit.wikimedia.org/r/380804 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata) [18:11:10] (03PS18) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 (https://phabricator.wikimedia.org/T157414) [18:19:00] (03PS1) 10Ayounsi: Remove old pfw-eqiad from DNS [dns] - 10https://gerrit.wikimedia.org/r/380806 [18:20:39] (03CR) 10Ayounsi: [C: 032] Remove old pfw-eqiad from DNS [dns] - 10https://gerrit.wikimedia.org/r/380806 (owner: 10Ayounsi) [18:22:02] 10Operations, 10ops-esams, 10DC-Ops, 10netops, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636354 (10RobH) [18:23:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10procurement: eqiad: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176338#3636355 (10RobH) [18:23:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3636356 (10ayounsi) [18:24:11] 10Operations, 10Gerrit, 10Release-Engineering-Team: Reimage cobalt as stretch - https://phabricator.wikimedia.org/T176774#3636363 (10Dzahn) Not wrong, just too early for that. gerrit2001 isn't done yet, it's in the middle of the process. Yes to the renaming part... later. [18:24:29] 10Operations, 10Gerrit, 10Release-Engineering-Team: Reimage cobalt as stretch - https://phabricator.wikimedia.org/T176774#3636364 (10Dzahn) p:05Triage>03Lowest [18:27:50] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [18:28:40] mutante ^^ [18:28:40] :) [18:30:58] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Reimage gerrit2001 as stretch - https://phabricator.wikimedia.org/T168562#3636377 (10Dzahn) Depends how you define it. If it's only about OS installation and applying the puppet roles without errors, yes. But Gerrit the serv... [18:32:02] paladox: recovery is good, but why was it broken. puppet run was ok yesterday [18:32:24] or is that from ongoing work to initialize it right now [18:33:10] (03PS1) 10ArielGlenn: Remove last of dataset module and use dumps module manifests instead [puppet] - 10https://gerrit.wikimedia.org/r/380810 (https://phabricator.wikimedia.org/T175528) [18:33:49] PROBLEM - Router interfaces on pfw3-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 70, down: 1, dormant: 0, excluded: 1, unused: 0 [18:38:23] mutante, paladox: https://phabricator.wikimedia.org/P6046 [18:38:32] Likely culprit here. [18:39:10] oh? " Unable to determine SqlDialect" ? [18:39:13] i am guessing it carn't connect to the db. Could it be we need to add it's ip to the db server to allow it to connect [18:39:52] Something re: ferm seems more likely. It's not a "can't auth" error, it /looks/ like it can't even complete a connection [18:40:05] The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server. [18:40:10] ^^ major hint [18:40:37] (03PS2) 10Ottomata: Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) [18:40:46] (03PS1) 10Cmjohnson: Testing new mgmt frack entries for frdb1002/3 [dns] - 10https://gerrit.wikimedia.org/r/380813 [18:40:47] mutante: Initial puppet failure / recovery was me, btw [18:41:05] (03CR) 10jerkins-bot: [V: 04-1] Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata) [18:41:13] no_justification: gotcha ! [18:41:19] eh, which one was it, m3-master.codfw [18:41:24] (03CR) 10Cmjohnson: [C: 032] Testing new mgmt frack entries for frdb1002/3 [dns] - 10https://gerrit.wikimedia.org/r/380813 (owner: 10Cmjohnson) [18:41:26] looks [18:41:33] that's the one for phab [18:41:41] m2-master.codfw.wmnet [18:41:50] Is what I've got in gerrit.config [18:42:07] i see, that is what is in hiera, yea [18:42:16] and also has otrs and misc stuff on it [18:42:35] I can ping m3 from gerrit2001 [18:42:49] mysql client isnt installed.. [18:42:53] what about telnet? [18:42:58] telnet m2-master.eqiad.wmnet 3306 [18:43:12] yea, that looks like firewalled [18:43:15] telnet works [18:43:21] eh, not for me [18:43:21] (03PS1) 10Cmjohnson: Revert "Testing new mgmt frack entries for frdb1002/3" [dns] - 10https://gerrit.wikimedia.org/r/380814 [18:43:25] (03PS23) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [18:43:25] Trying 10.64.0.166... [18:43:25] Connected to dbproxy1002.eqiad.wmnet. [18:43:28] bla bla bla [18:43:33] that's eqiad [18:43:38] but we should use codfw [18:43:44] Derp, copy+pasta [18:43:53] telnet m2-master.codfw.wmnet 3306 [18:43:53] Trying 10.192.0.14... [18:43:54] Yeap, you're right [18:43:54] (03CR) 10jerkins-bot: [V: 04-1] Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [18:43:56] So ferm [18:44:00] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:44:03] Hmm, wonder why this hadn't bitten us before [18:44:26] maybe it became stricter [18:44:55] (03PS2) 10Ottomata: [WIP] Set up separate druid public-eqiad cluster. [puppet] - 10https://gerrit.wikimedia.org/r/380804 (https://phabricator.wikimedia.org/T176223) [18:45:41] 10Operations, 10Ops-Access-Requests: Requesting access to pingback data for cicalese - https://phabricator.wikimedia.org/T176749#3636435 (10cicalese) I see now from [0] that the EventLogging data, which is what I need access to, is available from both `stat1005` and `stat1006`. From [1], it appears that I need... [18:45:43] (03PS24) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [18:46:03] aha [18:46:04] https://github.com/wikimedia/puppet/blob/bb021078cbc29d8ab5cdd6c61a2a2b269632ebdf/modules/role/templates/mariadb/grants/production-m2.sql.erb#L47 [18:46:10] mutante no_justification ^^ [18:46:13] (03CR) 10jerkins-bot: [V: 04-1] Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [18:46:21] that's for dbproxy1002 [18:46:33] paladox: it doesnt look like mysql grants though, i cant even connect [18:46:39] (03CR) 10Cmjohnson: [C: 032] Revert "Testing new mgmt frack entries for frdb1002/3" [dns] - 10https://gerrit.wikimedia.org/r/380814 (owner: 10Cmjohnson) [18:46:41] oh [18:46:42] in that case i would expect to be told i'm rejected [18:47:06] ferm on the db server though would explain it [18:47:33] (03PS25) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [18:47:35] (03PS2) 10ArielGlenn: Remove last of dataset module and use dumps module manifests instead [puppet] - 10https://gerrit.wikimedia.org/r/380810 (https://phabricator.wikimedia.org/T175528) [18:49:24] 10Operations, 10Ops-Access-Requests: Requesting access to pingback data for cicalese - https://phabricator.wikimedia.org/T176749#3635454 (10Ottomata) To access eventlogging MySQL DBs, you do not need to be in the `analytics-privatedata-users` group. https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Ac... [18:49:28] (03CR) 10Subramanya Sastry: "Let us get rid of fawiki update from this patch so we can proceed with the other two this week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) (owner: 10Zoranzoki21) [18:49:34] (03CR) 10ArielGlenn: [C: 032] Remove last of dataset module and use dumps module manifests instead [puppet] - 10https://gerrit.wikimedia.org/r/380810 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [18:50:16] yea, so.. it's allowed for all internal IPs [18:50:24] but gerrit2001 is public [18:50:26] that's why [18:51:35] mutante what about eqiad? [18:51:44] oh [18:51:45] i see [18:51:58] how is it set for the eqiad one? [18:52:38] differently:) it uses dbproxy1002 there [18:52:45] (03CR) 10Hoo man: [C: 031] "Fine to deploy at any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380793 (https://phabricator.wikimedia.org/T176771) (owner: 10Jayprakash12345) [18:52:56] oh [19:00:04] no_justification: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170926T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:11] (03PS3) 10Ottomata: Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) [19:00:37] (03CR) 10jerkins-bot: [V: 04-1] Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata) [19:02:24] (03PS4) 10Ottomata: Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) [19:05:53] (03PS10) 10Zoranzoki21: Enable RemexHTML on wikitech and eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) [19:10:22] (03CR) 10Subramanya Sastry: [C: 031] Enable RemexHTML on wikitech and eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) (owner: 10Zoranzoki21) [19:11:02] (03CR) 10Chad: [C: 032] group0 to 1.31.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380789 (owner: 10Chad) [19:11:11] (03CR) 10Subramanya Sastry: [C: 04-1] Enable RemexHTML on wikitech and eswikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) (owner: 10Zoranzoki21) [19:13:32] (03PS11) 10Zoranzoki21: Enable RemexHTML on wikitech and eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) [19:13:48] (03PS5) 10Ottomata: Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) [19:14:08] (03CR) 10Zoranzoki21: Enable RemexHTML on wikitech and eswikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) (owner: 10Zoranzoki21) [19:14:13] (03CR) 10jerkins-bot: [V: 04-1] Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata) [19:14:42] (03Merged) 10jenkins-bot: group0 to 1.31.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380789 (owner: 10Chad) [19:15:21] (03PS12) 10Zoranzoki21: Enable RemexHTML on wikitech and eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) [19:16:01] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.31.0-wmf.1 [19:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:18] (03CR) 10jenkins-bot: group0 to 1.31.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380789 (owner: 10Chad) [19:16:22] (03CR) 10Subramanya Sastry: [C: 031] Enable RemexHTML on wikitech and eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) (owner: 10Zoranzoki21) [19:17:18] (03CR) 10Krinkle: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [19:19:44] (03CR) 10Zoranzoki21: "How jenkins-bot changed from +1 on +2?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) (owner: 10Zoranzoki21) [19:20:29] (03CR) 10Krinkle: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [19:20:46] (03PS3) 10Ottomata: [WIP] Set up separate druid public-eqiad cluster. [puppet] - 10https://gerrit.wikimedia.org/r/380804 (https://phabricator.wikimedia.org/T176223) [19:21:28] (03PS6) 10Ottomata: Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) [19:23:20] (03PS4) 10Ottomata: [WIP] Set up separate druid public-eqiad cluster. [puppet] - 10https://gerrit.wikimedia.org/r/380804 (https://phabricator.wikimedia.org/T176223) [19:27:14] (03PS2) 10Andrew Bogott: validatelabsfqdn.py: Make check case-insensitive [puppet] - 10https://gerrit.wikimedia.org/r/380761 [19:28:35] (03PS1) 10Chad: Gerrit: Update known_hosts with newly reprovisioned gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/380824 [19:28:37] (03CR) 10Andrew Bogott: [C: 032] validatelabsfqdn.py: Make check case-insensitive [puppet] - 10https://gerrit.wikimedia.org/r/380761 (owner: 10Andrew Bogott) [19:28:51] mutante: ^^^ [19:30:24] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Reimage gerrit2001 as stretch - https://phabricator.wikimedia.org/T168562#3636612 (10Dzahn) [19:30:26] 10Operations, 10Gerrit, 10Release-Engineering-Team (Next): Gerrit is failing to start gerrit-ssh on gerrit2001 - https://phabricator.wikimedia.org/T176532#3636611 (10Dzahn) [19:30:58] (03PS2) 10Dzahn: Gerrit: Update known_hosts with newly reprovisioned gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/380824 (https://phabricator.wikimedia.org/T168562) (owner: 10Chad) [19:31:31] (03PS1) 10Dzahn: mariadb::misc: allow connections from gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/380827 (https://phabricator.wikimedia.org/T168562) [19:33:13] (03CR) 10Dzahn: [C: 032] Gerrit: Update known_hosts with newly reprovisioned gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/380824 (https://phabricator.wikimedia.org/T168562) (owner: 10Chad) [19:33:18] (03PS3) 10Dzahn: Gerrit: Update known_hosts with newly reprovisioned gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/380824 (https://phabricator.wikimedia.org/T168562) (owner: 10Chad) [19:37:27] 10Operations, 10Gerrit, 10Release-Engineering-Team (Next): Gerrit is failing to start gerrit-ssh on gerrit2001 - https://phabricator.wikimedia.org/T176532#3636684 (10Paladox) I think we found the likly culprit (Chad) found this P6046. It's not connecting to the db due to a firewall preventing it from being a... [19:38:21] 10Operations, 10Contributors-Team, 10MobileFrontend, 10wikidiff2, and 2 others: Diff page consistently produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3636688 (10greg) ``` 17:24 < legoktm> greg-g: https://phabricator.wikimedia.org/T176637 could potentially be an is... [19:38:26] (03PS7) 10Ottomata: Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) [19:39:14] 10Operations, 10ops-codfw: rack/setup/install furud.codfw.wmnet - https://phabricator.wikimedia.org/T176506#3636691 (10Papaul) The Power storage MD1400 doesn't have an entry in Racktables (HW type). Please add that in racktables. Thanks. [19:39:39] 10Operations, 10Gerrit, 10Release-Engineering-Team (Next): Gerrit is failing to start gerrit-ssh on gerrit2001 - https://phabricator.wikimedia.org/T176532#3636695 (10Dzahn) Here is a change to add firewall rules to mariadb::misc to fix that. https://gerrit.wikimedia.org/r/#/c/380827/ [19:40:13] (03CR) 10Paladox: [C: 031] mariadb::misc: allow connections from gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/380827 (https://phabricator.wikimedia.org/T168562) (owner: 10Dzahn) [19:42:54] ACKNOWLEDGEMENT - Router interfaces on pfw3-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 70, down: 1, dormant: 0, excluded: 1, unused: 0: Ayounsi JTAc case opened [19:45:52] 10Operations, 10Contributors-Team, 10MobileFrontend, 10wikidiff2, and 2 others: Diff page consistently produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3636703 (10greg) Adding @maxsem and #tcb-team per Lego on IRC [19:48:37] (03PS8) 10Ottomata: Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) [19:49:05] (03CR) 10jerkins-bot: [V: 04-1] Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata) [19:49:39] (03PS9) 10Ottomata: Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) [19:50:08] (03PS2) 10Dzahn: mariadb::misc: allow connections from gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/380827 (https://phabricator.wikimedia.org/T168562) [19:51:31] (03PS1) 10Jgreen: add mgmt.frack.eqiad.wmnet hostnames [dns] - 10https://gerrit.wikimedia.org/r/380840 [19:51:41] (03CR) 10Paladox: [C: 031] mariadb::misc: allow connections from gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/380827 (https://phabricator.wikimedia.org/T168562) (owner: 10Dzahn) [19:52:59] (03PS2) 10Jgreen: add mgmt.frack.eqiad.wmnet hostnames [dns] - 10https://gerrit.wikimedia.org/r/380840 (https://phabricator.wikimedia.org/T156397) [19:54:13] (03PS10) 10Ottomata: Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) [19:54:37] (03CR) 10jerkins-bot: [V: 04-1] Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata) [19:56:33] (03PS11) 10Ottomata: Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) [19:56:54] (03PS12) 10Ottomata: Improvements to druid profiles, move druid role out of analytics_cluster [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) [19:57:54] (03CR) 10Ayounsi: [C: 031] add mgmt.frack.eqiad.wmnet hostnames [dns] - 10https://gerrit.wikimedia.org/r/380840 (https://phabricator.wikimedia.org/T156397) (owner: 10Jgreen) [20:00:29] (03CR) 10Ottomata: "I have achieved NO-OP status! YES!" [puppet] - 10https://gerrit.wikimedia.org/r/380800 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata) [20:01:55] (03PS3) 10Jgreen: add mgmt.frack.eqiad.wmnet hostnames [dns] - 10https://gerrit.wikimedia.org/r/380840 (https://phabricator.wikimedia.org/T156397) [20:02:24] 10Operations, 10Contributors-Team, 10MobileFrontend, 10wikidiff2, and 2 others: Diff page consistently produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3636823 (10MaxSem) Stacktrace: {F9833532} [20:03:26] (03PS5) 10Ottomata: [WIP] Set up separate druid public-eqiad cluster. [puppet] - 10https://gerrit.wikimedia.org/r/380804 (https://phabricator.wikimedia.org/T176223) [20:04:56] (03CR) 10Jgreen: [C: 032] add mgmt.frack.eqiad.wmnet hostnames [dns] - 10https://gerrit.wikimedia.org/r/380840 (https://phabricator.wikimedia.org/T156397) (owner: 10Jgreen) [20:06:34] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3636861 (10Dzahn) Reading https://wikitech.wikimedia.org/wiki/Volunteer_NDA it seems that "A comment of approval from one Wikimedia Foundation manager... [20:15:38] !log phab2001 - install Apache and MariaDB package upgrades [20:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:43] !log phab1001 (prod phab) - upgrade Apache package [20:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:05] PROBLEM - mysqld processes on db2044 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [20:21:19] PROBLEM - MariaDB Slave SQL: s4 on db2044 is CRITICAL: CRITICAL slave_sql_state could not connect [20:21:29] PROBLEM - Disk space on db2044 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [20:21:39] PROBLEM - Check systemd state on db2044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:21:44] checking [20:21:49] PROBLEM - MariaDB Slave IO: s4 on db2044 is CRITICAL: CRITICAL slave_io_state could not connect [20:21:52] cool [20:22:05] PROBLEM - MariaDB disk space on db2044 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [20:22:08] storage broken [20:22:26] can you downtime it till tomorrow mutante ? [20:22:33] yea [20:22:34] * volans around [20:22:40] marostegui: need help? [20:23:16] going to depool it and create a task [20:23:49] PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apache2] [20:24:00] (03PS1) 10Marostegui: db-codfw.php: Depool db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380846 [20:24:07] raid failed? [20:24:09] downtime scheduled for 24 hours [20:24:09] volans: ^ [20:24:35] jynus: looks so [20:24:38] double-checks phab1001 - just upgraded apache there [20:24:47] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380846 (owner: 10Marostegui) [20:24:49] RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [20:25:01] https://phabricator.wikimedia.org/T174764 [20:25:02] mutante: can you downtime db2044 for me? [20:25:24] I can do it [20:25:26] marostegui: i did, all services on db2044 [20:25:31] thanks! [20:25:32] not the host itself [20:25:53] disable the disk alert too, so it doesn't go off again [20:25:57] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380846 (owner: 10Marostegui) [20:26:09] 10Operations, 10ops-codfw, 10DBA: db2044 HW RAID failure - https://phabricator.wikimedia.org/T174764#3637038 (10jcrespo) 05Resolved>03Open [20:26:15] all checks are in downtime, incl. disk space, just not PING of the host [20:26:23] it doesn't matter [20:26:27] marostegui: the RAID icinga check seems wrong: OK: Slot 0: no logical drives --- Slot 0: no drives [20:26:29] if it goes up, it pages again [20:27:20] disabled notifications for all services on db2044 [20:27:22] downtime doesn't apply to failed services [20:28:04] the root is mounted in RO right now [20:28:15] /dev/sda1 on / type ext3 (ro,relatime,errors=remount-ro,stripe=384,data=ordered) [20:28:25] 10Operations, 10ops-codfw, 10DBA: db2044 HW RAID failure - https://phabricator.wikimedia.org/T174764#3637045 (10Marostegui) Looks like this server has crashed again for the same reason: ``` [Tue Sep 26 20:17:42 2017] hpsa 0000:02:00.0: scsi 0:1:0:0: resetting logical Direct-Access HP LOGICAL VOLUM... [20:28:53] we can investigate tomorrow, nothing is going to break today :-) [20:29:09] (03CR) 10Marostegui: [V: 032 C: 032] db-codfw.php: Depool db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380846 (owner: 10Marostegui) [20:29:25] (03CR) 10jenkins-bot: db-codfw.php: Depool db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380846 (owner: 10Marostegui) [20:30:24] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2044 - T174764 (duration: 00m 50s) [20:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:30] T174764: db2044 HW RAID failure - https://phabricator.wikimedia.org/T174764 [20:30:45] ACKNOWLEDGEMENT - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-gerrit] daniel_zahn https://phabricator.wikimedia.org/T168562 [20:32:00] !log phab1001 - restarted apache after upgrade [20:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:26] 10Operations, 10ops-codfw, 10DBA: db2044 HW RAID failure - https://phabricator.wikimedia.org/T174764#3637071 (10Marostegui) I cannot see anything on ILO logs, last entry is from 9th Sept [20:32:46] let's investigate tomorrow [20:32:56] I am going to go to bed, jetlag is coming in [20:33:01] marostegui: ack, and let's add the icinga check to the list [20:33:05] it's not alarming [20:33:17] have some rest [20:37:57] (03PS2) 10Dzahn: DNS: Add mgmt DNS entries for furud Bug:T176506 [dns] - 10https://gerrit.wikimedia.org/r/380776 (owner: 10Papaul) [20:39:53] (03CR) 10Jcrespo: [C: 04-1] "I understand the problem, but the solution is a bit meh- all misc servers should not open its ports to publicly exposed. I would create a " [puppet] - 10https://gerrit.wikimedia.org/r/380827 (https://phabricator.wikimedia.org/T168562) (owner: 10Dzahn) [20:43:11] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt DNS entries for furud Bug:T176506 [dns] - 10https://gerrit.wikimedia.org/r/380776 (owner: 10Papaul) [20:44:51] (03CR) 10Jcrespo: [C: 04-1] "For context: https://phabricator.wikimedia.org/T104699#2560210 + https://phabricator.wikimedia.org/T104699#3617885" [puppet] - 10https://gerrit.wikimedia.org/r/380827 (https://phabricator.wikimedia.org/T168562) (owner: 10Dzahn) [20:45:42] 10Operations, 10ops-codfw: rack/setup/install furud.codfw.wmnet - https://phabricator.wikimedia.org/T176506#3637132 (10RobH) >>! In T176506#3636691, @Papaul wrote: > The Power storage MD1400 doesn't have an entry in Racktables (HW type). Please add that in racktables. > > Thanks. added to racktables diction... [20:46:48] (03PS26) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [20:48:11] (03CR) 10Phedenskog: Make values stackable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [20:48:32] (03PS27) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [20:59:56] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3637202 (10dr0ptp4kt) [21:01:23] 10Operations, 10Ops-Access-Requests: Requesting access to pingback data for cicalese - https://phabricator.wikimedia.org/T176749#3637204 (10CCicalese_WMF) That's fine with me, if it will do the job. [21:01:53] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2675091 (10dr0ptp4kt) [21:09:54] 10Operations, 10Datasets-General-or-Unknown: logrotate issue on dumps hosts - https://phabricator.wikimedia.org/T176810#3637240 (10Dzahn) [21:10:12] 10Operations, 10Datasets-General-or-Unknown: logrotate issue (cron spam) on dumps hosts - https://phabricator.wikimedia.org/T176810#3637253 (10Dzahn) p:05Triage>03Low [21:17:06] 10Operations, 10Datasets-General-or-Unknown, 10User-ArielGlenn: logrotate issue (cron spam) on dumps hosts - https://phabricator.wikimedia.org/T176810#3637287 (10ArielGlenn) [21:27:13] 10Operations, 10Gerrit, 10Release-Engineering-Team (Next): Gerrit is failing to start gerrit-ssh on gerrit2001 - https://phabricator.wikimedia.org/T176532#3637335 (10Paladox) @Dazhn we are going to have to manually open the port per jynus for now [22:00:42] 10Operations, 10ops-esams, 10DC-Ops, 10netops: cr2-esams temperature warning - https://phabricator.wikimedia.org/T176816#3637416 (10ayounsi) [22:02:50] (03PS3) 10Andrew Bogott: fullstack: optionally clean up leaked VMs after a point [puppet] - 10https://gerrit.wikimedia.org/r/379388 (https://phabricator.wikimedia.org/T167556) [22:02:52] jouncebot: now [22:02:53] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [22:03:40] !log gerrit master restart, back momentarily [22:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:08] whee [22:05:28] 10Operations, 10Ops-Access-Requests: Requesting access to pingback data for cicalese - https://phabricator.wikimedia.org/T176749#3635454 (10tstarling) There is research-client.cnf, accessible from the `researchers` group, and stats-research-client.cnf, which is identical, but accessible from the `stats` group.... [22:06:49] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [22:09:15] ^ that'll recover itself [22:34:19] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [22:59:31] 10Operations, 10Community-Tech, 10MediaWiki-General-or-Unknown, 10Stewards-and-global-tools (Temporary-UserRights): Temporary userrights not expiring in DB tables - https://phabricator.wikimedia.org/T176754#3637625 (10TTO) We could certainly think about running a cronjob for purging expired user groups. [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170926T2300). [23:00:05] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:28] I'm here [23:02:13] I can SWAT [23:02:37] Thanks [23:14:43] RoanKattouw: rcfilters change is live on wmf.19 and wmf.1 on mwdebug1002, check please [23:14:49] Looking [23:16:08] Working [23:16:11] thcipriani: Looks good [23:16:22] ok, going live wmf.1 first [23:19:09] !log thcipriani@tin Synchronized php-1.31.0-wmf.1/resources/src/mediawiki.rcfilters/mw.rcfilters.init.js: SWAT: [[gerrit:380801|RCFilters: Log performance data]] T176652 (duration: 00m 51s) [23:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:15] T176652: Performance review of RCFilters feature - https://phabricator.wikimedia.org/T176652 [23:19:23] 10Operations, 10Community-Tech, 10MediaWiki-General-or-Unknown, 10Stewards-and-global-tools (Temporary-UserRights): Temporary userrights not expiring in DB tables - https://phabricator.wikimedia.org/T176754#3637737 (10EddieGP) >>! In T176754#3637625, @TTO wrote: > We could certainly think about running a c... [23:20:34] !log thcipriani@tin Synchronized php-1.30.0-wmf.19/resources/src/mediawiki.rcfilters/mw.rcfilters.init.js: SWAT: [[gerrit:380802|RCFilters: Log performance data]] T176652 (duration: 00m 48s) [23:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:59] 10Operations, 10Community-Tech, 10MediaWiki-General-or-Unknown, 10Stewards-and-global-tools (Temporary-UserRights): Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3637751 (10EddieGP) [23:21:58] RoanKattouw: revert MWSignatureTool case is live on mwdebug1002, check please [23:22:14] Looking [23:22:35] Looks good [23:23:22] k going live [23:25:44] !log thcipriani@tin Synchronized php-1.31.0-wmf.1/extensions/VisualEditor/modules/ve-mw/ui/tools/ve.ui.MWSignatureTool.js: SWAT: [[gerrit:380842|Revert MWSignatureTool case]] (duration: 00m 48s) [23:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:53] ^ RoanKattouw live everywhere [23:25:56] Yay thanks [23:26:13] yw :) [23:26:42] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3637793 (10Framawiki) >>! In T176364#3628693, @Dereckson wrote: > To better understand your request, could you give a sample of tasks you would like to c... [23:52:21] (03CR) 10Chelsyx: [C: 031] profile::discovery_dashboards: Add daily forecasts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/380786 (https://phabricator.wikimedia.org/T112170) (owner: 10Bearloga)